Now in early development

Your mini AI Engineer

Bandito connects to your LLM traces and turns them into improvements. The trace logging space is commoditizing — Bandito is the intelligence layer that sits on top.

$ bandito loop run --project my-agent

  ~ Cost up 23% · Latency up 40%
  
  ────────────────────────────────────────
  
  Observing...
    1,247 traces · 12,847 spans · last 14 days
  
  ────────────────────────────────────────
  
  Judging...
    1,247 traces · pass rate: 87%
    ~ 12% failed · context window errors spiking
  
  ────────────────────────────────────────
  
  Analyzing...
    gpt-4o  320 calls · 89% quality · $0.028/trace
    gpt-4o-mini  927 calls · 82% quality · $0.002/trace
    · 60% of traces could use mini at 7% quality loss
    · retrieval spans adding 400ms avg
  
  ────────────────────────────────────────
  
  Recommendations:
    1. Route simple queries to gpt-4o-mini
       ~60% of traces · save ~$8.20/week
    2. Add assertions for max_tokens handling
    3. Create eval set from error traces
  
  Run: bandito improve replay --project my-agent

The Loop

Traces → Insights → Improvements

Bandito runs the same workflow a good engineer runs on Monday morning. It checks your traces, evaluates quality, finds patterns, and recommends fixes.

Instrument5 lines of code

ObserveCLI + TUI

GradeKeyboard-driven

JudgeLLM-powered

AnalyzeActionable insights

ImproveTest with confidence

Each step produces data the next step consumes. No step is useful without the prior step's output.

The Flywheel

The loop gets faster each cycle

First cycle: manual grading, rubric writing, calibration iteration. Slow but educational.

Second cycle: rubric is calibrated. judge run on new traces is one command. Analysis is instant.

Ongoing: judge runs automatically. Regressions caught before users notice.

More traces → better sampling

15 human grades become 500 judge scores. Every cycle compounds the signal.

Better grades → better judge

Judge disagrees on edge cases? Add examples to rubric. Calibration improves.

Richer data → confident ships

Offline replay shows the change is safe. Deploy. Watch. Loop restarts.

How It Works

From demo to production in one workflow

Instrument in minutes

Add 5 lines to your LLM app, or connect to an existing trace provider like Langfuse. Framework mappers auto-extract spans from Pydantic AI, Anthropic, OpenAI.

# Option 1: Add 5 lines
with bandito.trace("my-bot") as t:
    result = agent.run(query)
    t.done(result)

# Option 2: Use existing Langfuse traces
export BANDITO_STORAGE_BACKEND=langfuse

Observe, grade, judge

See what's happening. Grade with a keystroke. Scale with LLM-as-judge. Find where quality is low and why.

bandito observe traces --project my-bot
bandito tui
bandito judge run --project my-bot

Improve and ship

Test changes offline before deploying. Catch regressions before users see them. Loop restarts automatically.

bandito analyze tradeoffs --project my-bot
bandito improve replay --project my-bot

Built for Developers

Ship LLM apps with confidence

Bandito is for developers shipping LLM-powered products to production. Not dashboards to check — an engineer who watches the dashboard, understands what's wrong, tests a fix, and ships it.

Catch regressions — judge runs on every deployment catch issues the golden dataset missed
Optimize cost and quality — find the right model/prompt trade-off for your specific use case
Know when things break — diagnostics alert you to drift before users complain
Test offline — replay against datasets before deploying, no guesswork

Get started in minutes

pip install bandito. Instrument your app. Let Bandito do the rest.

pip install bandito