🚨BREAKING: New Python library for agentic data processing and ETL with AI
Introducing DocETL.
Here's what you need to know:
1. What is DocETL?
It's a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.
It offers:
- An interactive UI playground
- A Python package for running production pipelines
2. DocWrangler
DocWrangler helps you iteratively develop your pipeline:
- Experiment with different prompts and see results in real-time
- Build your pipeline step by step
- Export your finalized pipeline configuration for production use
K-means is one of the most powerful algorithms for data scientists.
But it's confusing for beginners. Let's fix that:
1. What is K-means?
Is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection.
2. Unsupervised:
K-means is an unsupervised algorithm that is used on data with no labels or predefined outcomes. The goal is not to predict a target output, but to explore the structure of the data by identifying patterns, clusters, or relationships within the dataset.
A new paper shows how you can predict real purchase intent without asking people.
~90% of human test–retest reliability.
Here's what's inside the 28 page paper:
1. Problem with direct Likert from LLMs:
When you ask LLMs to output 1–5 ratings directly, the distributions are too narrow/skewed and don’t look like human survey data, limiting usefulness for concept testing.
Have the LLM write a short free-text purchase-intent statement, then map that text onto a 5-point Likert score using embedding cosine similarity to predefined anchor sentences (i.e., semantic matching instead of raw numbers).