Your mini AI Engineer
Bandito connects to your LLM traces and turns them into improvements. The trace logging space is commoditizing — Bandito is the intelligence layer that sits on top.
$ bandito loop run --project my-agent
~ Cost up 23% · Latency up 40%
────────────────────────────────────────
Observing...
1,247 traces · 12,847 spans · last 14 days
────────────────────────────────────────
Judging...
1,247 traces · pass rate: 87%
~ 12% failed · context window errors spiking
────────────────────────────────────────
Analyzing...
gpt-4o 320 calls · 89% quality · $0.028/trace
gpt-4o-mini 927 calls · 82% quality · $0.002/trace
· 60% of traces could use mini at 7% quality loss
· retrieval spans adding 400ms avg
────────────────────────────────────────
Recommendations:
1. Route simple queries to gpt-4o-mini
~60% of traces · save ~$8.20/week
2. Add assertions for max_tokens handling
3. Create eval set from error traces
Run: bandito improve replay --project my-agentTraces → Insights → Improvements
Bandito runs the same workflow a good engineer runs on Monday morning. It checks your traces, evaluates quality, finds patterns, and recommends fixes.
Each step produces data the next step consumes. No step is useful without the prior step's output.
The loop gets faster each cycle
First cycle: manual grading, rubric writing, calibration iteration. Slow but educational.
Second cycle: rubric is calibrated. judge run on new traces is one command. Analysis is instant.
Ongoing: judge runs automatically. Regressions caught before users notice.
From demo to production in one workflow
# Option 1: Add 5 lines
with bandito.trace("my-bot") as t:
result = agent.run(query)
t.done(result)
# Option 2: Use existing Langfuse traces
export BANDITO_STORAGE_BACKEND=langfusebandito observe traces --project my-bot bandito tui bandito judge run --project my-bot
bandito analyze tradeoffs --project my-bot bandito improve replay --project my-bot
Ship LLM apps with confidence
Bandito is for developers shipping LLM-powered products to production. Not dashboards to check — an engineer who watches the dashboard, understands what's wrong, tests a fix, and ships it.
- Catch regressions — judge runs on every deployment catch issues the golden dataset missed
- Optimize cost and quality — find the right model/prompt trade-off for your specific use case
- Know when things break — diagnostics alert you to drift before users complain
- Test offline — replay against datasets before deploying, no guesswork