How Shadow Inbox was built with OpenClaw: an AI agent factory case study
We built Shadow Inbox with OpenClaw, an AI agent factory. Here's the agent role split, the 1-week MVP, the surprises, and what we'd do differently.

We built the first working version of Shadow Inbox in a week. The production hardening took four. Most of the application code in that initial week was generated by agents inside OpenClaw — a research agent, a scrape agent, a classifier age
We built the first working version of Shadow Inbox in a week. The production hardening took four. Most of the application code in that initial week was generated by agents inside OpenClaw — a research agent, a scrape agent, a classifier agent, and an ops agent — coordinated through a typed message bus. We did the architecture, the security, the prompts, and most of the production debugging.
This is the build log. What worked, what broke, what we'd do differently. No abstractions. The numbers and stories below are from our own commit history.
Agents wrote most of the boilerplate. Humans wrote everything that mattered. The trick was knowing which was which on day one.
The four agents and how the work split.
OpenClaw lets you define agents as roles with explicit inputs, outputs, and tool access. We split Shadow Inbox into four:
research-agent
role: read API docs, summarize endpoints, propose data models
tools: WebFetch, file_write (docs only), no code execution
scrape-agent
role: write fetchers for Reddit and HN APIs, rate-limit, dedupe
tools: code execution, file_write (src/scrapers), test runner
classifier-agent
role: design and iterate the relevance + intent prompts
tools: LLM eval harness, file_write (src/classifiers, evals)
ops-agent
role: schema migrations, CRUD scaffolds, queue config
tools: code execution, file_write (src/db, src/queue), migrationsThe orchestrator (a thin OpenClaw wrapper we wrote) routes tasks to agents and humans review every PR before merge. Critically, no agent had write access outside its own subtree, and no agent could touch the auth, billing, or secret-handling code at any point.
Week 1, day by day.
Day 1. We wrote the architecture doc by hand. Pipeline shape, agent role boundaries, security perimeter. Then we kicked off the research-agent against the Reddit and HN API docs and let it produce structured summaries.
Day 2. The scrape-agent wrote first-pass fetchers for both platforms. Rate limiting was wrong (it tried to use a token bucket library that didn't exist — a classic hallucination). We fixed it by hand and added a "preferred libraries" file the agent had to consult.
Day 3. The ops-agent built the Postgres schema and the BullMQ queue config. This was the cleanest agent output of the week — schema migrations are well-specified, well-documented territory. Almost no rewrites.
Day 4. The classifier-agent ran its first prompt iteration against 50 manually-labeled signals we'd hand-curated overnight. First precision was 41%. By end of day, after the agent ran 8 prompt variants and we picked the best two, precision was 67%.
Day 5. Glue and dashboard. The dashboard front-end was hand-written by us in Next.js because the agents kept producing UI that looked like 2014 Bootstrap. We're slower at UI but the result is shippable.
Day 6. End-to-end test. First real Reddit signal hit the dashboard at 4:42pm. We celebrated and immediately found three bugs.
Day 7. Bug fixing. Mostly the bugs that don't show up until you point a system at the actual internet — UTF-8 in usernames, posts that get deleted between scrape and classify, rate-limit backoff that didn't exponential the way we thought.
The four-week production hardening.
The MVP worked. Production-grade did not exist. Weeks two through five were us fixing what agents couldn't see and what we hadn't anticipated.
Week 2: queue resilience. The BullMQ setup the ops-agent shipped was fine for happy-path. Real production needs dead-letter queues, retry-with-backoff schedules per error type, idempotency keys, and observability. We rewrote about 40% of the queue code.
Week 3: classifier eval harness. We needed a real eval system, not a 50-signal hand-labeled set. We built a labeling UI for ourselves, ingested 2,000 signals over a week, and re-tuned the classifier prompts against the larger set. Precision climbed from 67% to 84%.
Week 4: secrets, auth, billing. Hand-written. We didn't let agents anywhere near these. The Stripe integration, the per-user API key vault, the OAuth flow for connecting a Reddit account — all human. About 1,800 lines of careful code we wouldn't trust to any agent today.
Week 5: monitoring, alerts, runbooks. Sentry, structured logs, pager duty, and the runbook docs that actually get used at 2am. Agents helped draft the runbook templates; humans wrote the actual procedures.
Bugs that surprised us.
Three failure modes we did not anticipate.
The hallucinated API endpoint. Mentioned in the FAQ. The classifier-agent invented /api/v3/intent_signals for Reddit, built a client for it, wrote tests for the client, and the tests passed because the agent also wrote the mocks. We caught it in staging when nothing returned data. The fix: forbid agents from writing both the integration code and the corresponding tests in the same PR. Tests had to come from a different agent or a human.
The infinite loop in retry logic. The ops-agent wrote a queue retry handler that, when a job failed with a network error, retried it. The retry also went through the same network path. The same network was down. The retry failed. The retry-of-retry was scheduled. After 6 hours we had 4.2 million queued retries of 12 actual jobs. The fix: bounded retries, jittered backoff, and a circuit breaker around the network dependency.
Context overflow in the classifier-agent. When iterating prompts, the agent kept appending notes about each variant to its own context. After 30 iterations the context was 180K tokens and the agent was confused, contradicting itself, and proposing prompts it had already rejected. The fix: explicit context resets between iterations and a separate "memory" agent that maintained a curated history of what had been tried.
What worked surprisingly well.
Prompt evaluation at scale. The classifier-agent ran thousands of prompt variants against the held-out signal set. It surfaced two prompt rewrites that lifted precision by about 11 points combined. We never would have tested those variants by hand. The full design of those prompts is in our LLM intent classification piece.
API doc digestion. The research-agent reading the Reddit and HN API docs and producing typed TypeScript schemas was a 4-hour task done in 20 minutes. It was wrong in a few places — the Algolia HN API has undocumented quirks — but the human review caught those.
Schema migrations. Postgres schemas are a well-specified domain. The ops-agent shipped roughly 30 migrations across the build with one defect that required a manual rollback. The error rate on agent-written migrations was lower than ours had historically been.
What we'd do differently.
Start with a tighter blast radius. We let agents touch infra code (Vercel config, environment variables) in week one. That was a mistake. We'd restrict agents to application logic only on the next build until we had a stable test harness.
Build the eval harness on day one. We hand-labeled 50 signals on day 4 and then again 2,000 in week 3. We should have built the labeling UI on day 1 and started accumulating labels from day 2. The classifier improvements would have come weeks earlier.
Separate code-writing and test-writing from the start. We learned this the hard way with the hallucinated API. We'd hardcode role separation from PR #1 next time.
Keep humans in the prompt loop. The classifier agent generated dozens of prompt variants but the good prompts came from humans inspecting the agent's output and seeing what we'd never have written. Agents are great at exploring; humans are better at noticing what's actually clever.
What this stops working when.
Three places we'd hesitate to use the OpenClaw approach.
Domains where the spec is fuzzy. Agents need a clear input and output contract. UI design, brand voice, and "is this a good user experience" are not specifiable enough yet. We did all of those by hand.
Code with security surface. Auth, billing, secret handling, anything that touches PII. The cost of a hallucinated bug here is much higher than the time saved. Hand-write it.
Domains where the failure mode is silent. If an agent ships code that looks right and runs without error but produces subtly wrong results (off-by-one in an analytics aggregation, for example), you'll find out in a customer email three months later. We ship agent code only where the failure mode is loud — exception, test fail, or visible output corruption.
The build-vs-buy math for sales intelligence specifically — should you build something like Shadow Inbox or pay for it — is broken down in our build vs buy piece. And the system architecture itself is detailed in the Reddit lead gen playbook and the enrichment workflows piece.
Where this leaves us.
Shadow Inbox today is mostly human-maintained code with agent-written test scaffolds, schema migrations, and prompt evals. The MVP-week ratio of 60% agent code dropped to roughly 18% in production because the gnarly parts ate the codebase. That's fine. The agent factory got us to working software in a week instead of a month, and gave us a stable enough foundation to harden by hand.
If you're building a complex multi-agent product, OpenClaw or something like it will save you weeks. Just keep the security perimeter human-only and don't let any agent grade its own homework.
● FAQ
- What is OpenClaw, exactly?
- OpenClaw is an AI agent factory — a framework for spinning up specialized agents (research, scrape, classify, ops) that share a common orchestrator and a typed message bus. We used it to coordinate four agents across the Shadow Inbox build instead of writing a monolithic backend. Think of it as Kubernetes for agents but with much more opinionated message contracts.
- How much of the codebase did agents write vs humans?
- Roughly 60 percent of the initial MVP code was agent-generated and human-reviewed. By the production hardening phase, that ratio inverted — humans wrote the gnarly parts (auth, billing, the queue retry logic) and agents handled the boilerplate (CRUD, schema migrations, test scaffolds). The agent contribution was largest where the task was repetitive and the spec was tight.
- What was the worst agent failure mode?
- Hallucinated APIs. The classifier agent at one point invented a Reddit endpoint that doesn't exist (/api/v3/intent_signals) and wrote a perfectly plausible client for it. The integration tests passed because we'd let the agent write the mocks too. We caught it in staging when nothing returned data. The fix was to forbid agents from writing both the integration code and the corresponding tests in the same PR.
- Did agents help with the LLM classifier prompts themselves?
- Yes — and this was the most underrated win. We had an evaluation agent run thousands of prompt variants against a held-out signal set and report precision/recall. It surfaced two prompt rewrites that lifted precision by about 11 points combined. We never would have tested those variants by hand.
- Would we use OpenClaw again for the next product?
- Yes, with a tighter scope. The biggest mistake was letting agents touch infrastructure code in week one. We'd start with agents only on application logic, keep humans in the infra and security loop end-to-end, and expand the agent surface area only after we had a stable test harness.
Three more from the log.

Building vs buying sales intelligence in 2026: the actual build cost
Honest cost breakdown of building sales intelligence in-house vs buying. Scraping infra, LLM APIs, classifier engineering, enrichment, maintenance — actual numbers.
Mar 19, 2026 · 9 min
The AI reply generator dilemma: fast and cheap vs personalized and slow
AI reply generators face a dilemma: fast and cheap is templated-with-extra-steps, slow and personalized is barely better. The third mode no one is building.
Feb 12, 2026 · 13 min
Using LLMs to classify buying intent: what actually works
Classifying buying intent with LLMs is mostly about evidence quotation, temperature pinning, and not asking the model dumb questions. Here's what worked for us.
Dec 11, 2025 · 7 min