Using LLMs to classify buying intent: what actually works
Classifying buying intent with LLMs is mostly about evidence quotation, temperature pinning, and not asking the model dumb questions. Here's what worked for us.

Asking an LLM "is this a sales lead?" is the wrong question. We tried it. We got a model that classifies 80% of Reddit posts as leads because it has no idea what a lead actually is and is trained to be helpful. The right approach is decompo
Asking an LLM "is this a sales lead?" is the wrong question. We tried it. We got a model that classifies 80% of Reddit posts as leads because it has no idea what a lead actually is and is trained to be helpful.
The right approach is decomposition: relevance first with embeddings, intent second with a generative model, evidence quotation forced before the verdict, and a tight JSON schema with no creative slack. We've shipped this on hundreds of customer profiles. Here's what actually held up in production.
The LLM is not your classifier. The prompt is your classifier. The LLM is the conveyor belt.
Two-stage beats one-stage every time.
The first instinct is to throw the whole post at one big LLM call and ask "is this a buyer?" Don't. The cost-precision math is bad and the failure modes are obscure.
The shape that works:
Stage 1: Relevance filter
- Embed the post (small model, e.g. text-embedding-3-small)
- Cosine sim against the user's ICP profile embedding
- If sim < 0.62, drop. Skip stage 2 entirely.
- Cost: ~$0.00002 per post
Stage 2: Intent classifier
- Single Claude Sonnet 4.6 call, temperature 0
- Forces evidence quotation before verdict
- Returns structured JSON
- Cost: ~$0.002 per postStage 1 drops about 85% of posts before they ever hit stage 2. That's a 100x cost reduction and a 50x latency reduction with no precision loss. We've measured this on multiple customer profiles and it's never been close.
The prompt that actually works.
Here's the production shape, lightly reformatted for readability.
You are a buying-intent classifier. The user sells the following product:
PRODUCT: {one_line_product_description}
ICP: {ideal_customer_profile_summary}
COMPETITORS: {comma_separated_competitor_list}
Below is a post from {platform}. Read it carefully and answer in structured JSON.
POST:
"""
{post_title}
{post_body}
"""
Step 1: In one sentence, what is the OP asking or describing?
Step 2: Extract any of the following if present, quoting exact spans:
- budget mention (currency or per-month figure)
- deadline language (time-bound decision phrase)
- tools they have already evaluated
- competitor name + frustration verb
- specific use case description
Step 3: Classify the OP's intent as exactly one of:
- buyer_now (deciding within 30 days, has current pain)
- buyer_soon (researching, 30-90 days out)
- tire_kicker (curious, no urgency or pain)
- venting (complaining, no purchase intent)
- off_topic (does not match the product)
Step 4: Return ONLY this JSON, no prose:
{
"intent": "<class>",
"confidence": <0.0-1.0>,
"evidence": ["<exact quoted span>", ...],
"summary": "<one sentence>"
}
If you cannot quote at least one evidence span for a non-off_topic verdict,
return off_topic with confidence 0.5.Five things in that prompt are doing the work.
1. Product, ICP, and competitor list inline. Without this, the model classifies in the abstract and says "buyer" for everything that mentions a tool. With it, the model has a yardstick.
2. Numbered steps. This is chain-of-thought without the chain-of-thought handwave. It forces the model to extract before it judges. Skip the steps and precision drops.
3. Quoted evidence spans. The model has to point at the exact words that justify the verdict. This catches roughly 1 in 12 hallucinations because the model can't quote what isn't there.
4. The forced fallback. "If you cannot quote at least one evidence span for a non-off_topic verdict, return off_topic." This single line cut our false-buyer rate by about 30% in the first eval pass. Models love to classify; they need permission to abstain.
5. JSON-only output. No markdown, no preamble. The downstream pipeline parses this. Any prose breaks the parser.
Temperature 0, always. There is no debate.
This is a classifier. Creativity is a bug. Set temperature to 0 and never touch it again. We've watched teams set temp to 0.3 because "it sounds nicer" and then wonder why the same post gets two different verdicts on retry.
If you need diversity for some reason — you don't, for this — get it from prompt variation, not sampling.
What breaks (and how we found out).
Keyword-based filters as a primary signal. Early on we shipped a keyword-trigger pre-filter ("only run the LLM if 'crm' or 'pipeline' appears"). Looked smart. Was dumb. Buyers say "we're losing track of deals in spreadsheets." That post never matches "crm" but is a textbook CRM buyer. Embedding-based relevance catches it. Keyword filters miss it.
Asking "is this a lead?" We tried this in week one. The model said yes to 80% of inputs because it's trained to be helpful and "lead" is fuzzy. Switching to a 5-class enum with explicit definitions cut the false positive rate to roughly 18% on the same dataset.
No evidence requirement. Without forced quotation, the model invents budget figures and deadlines that aren't in the post. Operators report "wait, the post never said $5K" — and they're right, the model hallucinated it. Forced quotation kills this.
Single shared prompt across all customers. A SaaS prompt does not work for a real estate agent. Each customer's product, ICP, and competitor list change the meaning of the same Reddit post. We inject those per-customer.
The four false-positive traps we filter for.
Once you have an intent classifier that mostly works, the residual errors fall into four buckets. We named and filtered each.
The venter. "I hate Salesforce, it's so expensive, why does enterprise software suck." High emotional valence, no purchase intent. The classifier catches most of these via the venting class but the borderline cases need a second filter on emotional valence words (hate, suck, garbage, terrible) without action verbs (looking, evaluating, switching).
The dev evaluating. A developer at a competitor checking out the space. Posts in r/programming asking "what do people think of Tool X" — they're not buying, they're benchmarking. Filter on account history; if the OP's recent posts are all dev-tool questions, weight intent down.
The student/researcher. "Hi, I'm doing my MBA capstone on CRM tools, can anyone share thoughts?" Easy to spot from the prompt phrasing. We added "academic_research" as a sub-class of off_topic and the model picks it up reliably.
The competitor doing recon. Sock puppet account, 3 weeks old, asking suspiciously specific questions about your product. Account-age signals in the post anatomy piece catch most of these before the LLM ever sees them.
How we evaluate without a labeled dataset.
Most operators don't have 5,000 labeled examples to train on. Neither did we. We built the eval out of operator feedback.
Every signal in the dashboard has a thumbs up / thumbs down. The thumbs become labels. After 200 labels per customer profile, we have enough signal to compute:
- Precision at top decile (what % of the highest-scored posts were actually buyers)
- Recall on flagged false negatives (what % of "good ones we missed" did the system catch on next iteration)
- Drift over time (are the same prompts performing the same way 60 days in)
We retune prompts and thresholds every 30 days based on this. The single best diagnostic is reading the bottom 10 false positives and the bottom 10 false negatives — the patterns jump out and you can update the prompt or the score weights to address them.
Cost math at scale.
For a single customer monitoring 12 subreddits + 6 HN keyword sets, we see roughly:
- 800 raw posts/comments per day
- 120 surviving the embedding relevance filter (15%)
- 22 classified as some form of buyer (18% of survivors)
- 8 in the buyer_now or buyer_soon class
The cost: $0.016 in embeddings + $0.24 in Sonnet calls = roughly $0.26 per day per customer, or $8 a month. That's the all-in classifier compute cost. The rest of the unit economics is enrichment and infra.
The full enrichment pipeline that takes the buyer_now posts and converts them to verified emails is in the enrichment workflows piece. And the AI-vs-human reply tradeoff is laid out in the AI reply generator dilemma.
What this stops working when.
Three honest failure modes:
Non-English posts. Sonnet handles Spanish and Portuguese reasonably; everything else needs a per-language prompt and per-language ICP. We have customers running monolingual setups in 4 languages. Don't try to run a multilingual classifier with one prompt.
Bleeding-edge categories. If your category is so new that buyers don't have shared vocabulary yet, the classifier struggles because the relevant posts don't look like buyer posts. We work around this by switching the relevance step from category embeddings to pain-language embeddings ("losing track of...", "drowning in...", "spending hours on...").
Rapid product pivots. If your customer changes their ICP every 6 weeks, the classifier never stabilizes because the eval data goes stale. We've learned to ask new customers to commit to an ICP for 60 days before we'll let them retune.
The full pipeline this slots into is laid out in the Reddit lead gen playbook.
● FAQ
- Which model do we run for the intent classifier?
- Claude Sonnet 4.6 at temperature 0 with chain-of-thought before the verdict. We also tested GPT-5 and Gemini 2.5 — Sonnet wins on precision for our shape of prompt, mostly because it's better at refusing to classify when evidence is thin. The other models will happily mark a venting post as a buyer if you let them.
- Why not use embeddings alone for intent?
- Embeddings are excellent for relevance — does this post talk about CRMs at all — but they can't distinguish a buyer from a competitor doing recon. Both have nearly identical embeddings. You need a generative model to read the intent, not just match the topic.
- How do we evaluate the classifier without a labeled dataset?
- Operator feedback. Every signal in the dashboard has a thumbs up/down. We treat the thumbs as the eval set and re-tune the prompt + threshold every 30 days. After 200 labels per customer the precision is stable enough to stop fiddling. Before that, expect some drift.
- What's the latency budget for an intent classifier in production?
- Under 2 seconds end-to-end is the bar. Posts come in via cron, get embedded (~80ms), get classified (~1.2s for Sonnet on a typical post), and get pushed to the dashboard. If your classifier takes 6 seconds you'll back up your queue during HN's morning rush.
- Can a smaller model handle this if we're cost-sensitive?
- Haiku-class models will work for the relevance step but consistently underperform on the intent step in our evals — they miss the venting-vs-buying distinction roughly twice as often as Sonnet. The math usually favors paying 5x more per call for half the false positives, especially when each false positive is operator time wasted on a non-buyer.
Three more from the log.

The anatomy of a high-intent Reddit post: 10 signals we extract at Shadow Inbox
Ten extractable signals separate a buyer post from a vent post on Reddit. Here's how we score each one, the thresholds, and the false-positive traps.
Mar 03, 2026 · 8 min
The signal economy: why intent beats volume in 2026
The signal economy: why intent-based outbound beats volume in 2026, the new operator stack, and where real-time intent graphs go from here.
Apr 09, 2026 · 12 min
The 5-minute HN monitoring setup that replaces a $5K SDR tool
HackerNews monitoring with the Algolia API, a cron, and an optional LLM filter. The full setup costs nothing and replaces the $5K SDR tool you almost bought.
Mar 11, 2026 · 6 min