Evaluations Overview
Evaluations define what "success" looks like for your AI application. They're the criteria Empromptu uses to measure performance and guide optimization.
What you'll learn ⏱️ 5 minutes
How to create effective evaluation criteria
When to use manual vs automatic evaluation creation
How to manage and organize your evaluations
Best practices for writing evaluation criteria
How evaluations impact optimization results
What Are Evaluations?
Evaluations are specific, measurable criteria that define good performance for your AI application. Instead of subjective judgment, evaluations provide objective benchmarks for optimization.
Example: Review Summarizer Evaluations
"Correct Sequence": Information appears in logical order that reflects the structure of the input
"Accurate Details": All extracted details match what was present in the original text
"Complete Summary": The summary captures all key points without omitting important information
"Appropriate Length": Summary length matches the specified requirements
Each evaluation gets scored individually, contributing to your overall accuracy metrics.
Creating Evaluations
You have two options for creating evaluations:
Automatic Generation
Best for: Getting started quickly with proven evaluation criteria.
How it works:
Click "Add Evaluation"
Select "Generate Automatically"
Empromptu analyzes your task and creates relevant evaluations
Review and activate the generated criteria
Benefits:
Fast setup
Proven criteria based on similar use cases
Good starting point for further customization
Manual Creation
Best for: Specific requirements or fine-tuned control over success criteria.
Process:
Click "Add Evaluation"
Select "Create Manually"
Write your evaluation name and criteria
Test with sample inputs
Activate when satisfied
Benefits:
Complete control over criteria
Task-specific requirements
Custom business logic
Writing Effective Evaluation Criteria
Be Specific and Measurable
❌ Poor: "Output should be good" ✅ Good: "Summary should include all product features mentioned in the review"
❌ Poor: "Response should be helpful" ✅ Good: "Response should provide at least 2 actionable solutions to the customer's problem"
Focus on Observable Outcomes
Your criteria should describe what you can clearly identify in the output:
Format requirements:
"Information appears in logical sequence"
"Response includes exactly 3 bullet points"
"Output contains no more than 150 words"
Content requirements:
"All bugs mentioned in input appear in the output"
"Summary preserves technical terminology from original text"
"Response addresses the specific question asked"
Quality requirements:
"Extracted details match what was in the original text"
"No hallucinated information appears in the response"
"Tone remains professional and helpful"
Examples by Use Case
Data Extraction Applications
"Complete Extraction": "All contact information present in the document appears in the structured output"
"Accurate Formatting": "Phone numbers follow (XXX) XXX-XXXX format"
"No Duplication": "Each piece of information appears only once in the output"
Customer Support Applications
"Question Recognition": "Response demonstrates understanding of the customer's specific issue"
"Solution Provided": "Response includes at least one actionable step to resolve the problem"
"Appropriate Escalation": "Complex technical issues are escalated to human agents"
Content Generation Applications
"Brand Voice": "Content matches the company's established tone and style"
"Factual Accuracy": "All claims in the content can be verified from provided sources"
"Target Length": "Content falls within specified word count requirements"
Managing Your Evaluations
Evaluation Status
Each evaluation can be Active or Inactive:
Active evaluations:
Used in optimization scoring
Contribute to overall accuracy metrics
Guide automatic optimization decisions
Inactive evaluations:
Saved for future use
Don't affect current scoring
Can be reactivated when needed
Organizing Evaluations
Best practices:
Group related criteria: Keep similar evaluations together
Use clear naming: Make evaluation purposes obvious from the name
Regular review: Deactivate outdated criteria, add new ones as needs evolve
Test thoroughly: Verify evaluations work as expected before activating
Evaluation Actions
For each evaluation, you can:
Activate/Deactivate: Toggle whether it's used in scoring
Modify: Edit criteria and descriptions
Delete: Remove evaluations you no longer need
Duplicate: Create variations of existing criteria
How Evaluations Impact Optimization
Scoring Process
When your AI application processes an input:
Output generated: Your AI creates a response
Evaluations applied: Each active evaluation scores the output
Individual scores calculated: Each evaluation gets a 0-10 score
Overall score computed: Average of all evaluation scores
Results logged: Scores and reasoning saved to Event Log
Optimization Guidance
Evaluations guide both automatic and manual optimization:
Automatic optimization:
Focuses on improving the lowest-scoring evaluations
Creates Prompt Family variations to handle different criteria
Prioritizes changes that improve overall evaluation performance
Manual optimization:
Shows which specific evaluations need attention
Helps you target optimization efforts effectively
Provides clear metrics for measuring improvement
Score Interpretation
Individual evaluation scores follow the same 0-10 scale:
0-3: Evaluation criteria not met
4-6: Partially meets criteria
7-8: Meets criteria well
9-10: Exceeds criteria expectations
Common Evaluation Patterns
Accuracy-Focused Evaluations
Purpose: Ensure factual correctness and completeness
"Extracted Complete Bug Set": "All bugs mentioned also appear in the output""Accurate Details": "All extracted details were present and correct in the original text" "No Hallucination": "Output contains no information not found in the input"Format-Focused Evaluations
Purpose: Ensure consistent structure and presentation
"Correct Sequence": "Information appears in logical order""Proper Structure": "Output follows the specified template format""Length Requirements": "Response length falls within specified range"Quality-Focused Evaluations
Purpose: Measure overall usefulness and appropriateness
"Addresses Question": "Response directly answers what was asked""Professional Tone": "Language is appropriate for business communication""Actionable Content": "Provides specific steps user can take"Best Practices
Start Simple
Begin with 3-5 core evaluations covering your most important requirements
Test with sample inputs to ensure they work as expected
Run initial optimization to see how they perform
Add more specific criteria as you identify areas for improvement
Balance Coverage and Focus
Cover key areas: Accuracy, format, and quality
Avoid redundancy: Don't create overlapping evaluations
Prioritize impact: Focus on criteria that matter most to your users
Iterate Based on Results
Monitor evaluation performance in your Event Log
Identify consistently low-scoring criteria for revision
Add new evaluations when you discover edge cases
Remove or modify evaluations that aren't providing value
Test Thoroughly
Use diverse inputs when testing evaluation criteria
Check edge cases to ensure evaluations work reliably
Verify scoring matches your expectations
Get feedback from team members on evaluation clarity
Troubleshooting Evaluations
Low Scores Across All Evaluations
Possible causes:
Evaluations too strict
Prompts need optimization
Input quality issues
Solutions:
Review and adjust evaluation criteria
Run prompt optimization
Improve input examples
Inconsistent Evaluation Results
Possible causes:
Vague or subjective criteria
Evaluations testing multiple things at once
Solutions:
Make criteria more specific and measurable
Split complex evaluations into separate criteria
Add examples of what success looks like
High Scores But Poor Real-World Performance
Possible causes:
Evaluations don't match real use cases
Missing important criteria
Solutions:
Review actual end-user inputs and outputs
Add evaluations based on real-world requirements
Test with more diverse input examples
Next Steps
Now that you understand evaluations:
Set up Prompt Optimization: Use your evaluations to improve performance
Learn about Edge Case Detection: Find inputs that score poorly on your evaluations
Monitor End User Inputs: See how your evaluations perform with real data
Understand Model Optimization: Test which AI models perform best on your evaluations
Last updated