Component Selection
LangfuseProduction traces, datasets, and comparisons
PromptfooPrompt and model regression tests
DeepEvalMetrics and judge-based evaluation
CI / DashboardRelease gates and trend monitoring
The common failure is invisible regression. The loop connects traces, bad cases, datasets, and release gates.
From “feels good” to testable release criteria
The common failure is invisible regression. The loop connects traces, bad cases, datasets, and release gates.
Collect calls, tool use, retrieval, and failure paths.
Turn bad cases and high-value requests into reproducible samples.
Evaluate accuracy, faithfulness, format, robustness, and safety.
Block obvious regressions in PR and release workflows.