Evaluations Overview

Evaluations define what "success" looks like for your AI application. They're the criteria Empromptu uses to measure performance and guide optimization.

What you'll learn ⏱️ 5 minutes

How to create effective evaluation criteria
When to use manual vs automatic evaluation creation
How to manage and organize your evaluations
Best practices for writing evaluation criteria
How evaluations impact optimization results

What Are Evaluations?

Evaluations are specific, measurable criteria that define good performance for your AI application. Instead of subjective judgment, evaluations provide objective benchmarks for optimization.

Example: Review Summarizer Evaluations

"Correct Sequence": Information appears in logical order that reflects the structure of the input
"Accurate Details": All extracted details match what was present in the original text
"Complete Summary": The summary captures all key points without omitting important information
"Appropriate Length": Summary length matches the specified requirements

Each evaluation gets scored individually, contributing to your overall accuracy metrics.

Creating Evaluations

You have two options for creating evaluations:

Automatic Generation

Best for: Getting started quickly with proven evaluation criteria.

How it works:

Click "Add Evaluation"
Select "Generate Automatically"
Empromptu analyzes your task and creates relevant evaluations
Review and activate the generated criteria

Benefits:

Fast setup
Proven criteria based on similar use cases
Good starting point for further customization

Manual Creation

Best for: Specific requirements or fine-tuned control over success criteria.

Process:

Click "Add Evaluation"
Select "Create Manually"
Write your evaluation name and criteria
Test with sample inputs
Activate when satisfied

Benefits:

Complete control over criteria
Task-specific requirements
Custom business logic

Writing Effective Evaluation Criteria

Be Specific and Measurable

❌ Poor: "Output should be good" ✅ Good: "Summary should include all product features mentioned in the review"

❌ Poor: "Response should be helpful" ✅ Good: "Response should provide at least 2 actionable solutions to the customer's problem"

Focus on Observable Outcomes

Your criteria should describe what you can clearly identify in the output:

Format requirements:

"Information appears in logical sequence"
"Response includes exactly 3 bullet points"
"Output contains no more than 150 words"

Content requirements:

"All bugs mentioned in input appear in the output"
"Summary preserves technical terminology from original text"
"Response addresses the specific question asked"

Quality requirements:

"Extracted details match what was in the original text"
"No hallucinated information appears in the response"
"Tone remains professional and helpful"

Examples by Use Case

Data Extraction Applications

"Complete Extraction": "All contact information present in the document appears in the structured output"
"Accurate Formatting": "Phone numbers follow (XXX) XXX-XXXX format"
"No Duplication": "Each piece of information appears only once in the output"

Customer Support Applications

"Question Recognition": "Response demonstrates understanding of the customer's specific issue"
"Solution Provided": "Response includes at least one actionable step to resolve the problem"
"Appropriate Escalation": "Complex technical issues are escalated to human agents"

Content Generation Applications

"Brand Voice": "Content matches the company's established tone and style"
"Factual Accuracy": "All claims in the content can be verified from provided sources"
"Target Length": "Content falls within specified word count requirements"

Managing Your Evaluations

Evaluation Status

Each evaluation can be Active or Inactive:

Active evaluations:

Used in optimization scoring
Contribute to overall accuracy metrics
Guide automatic optimization decisions

Inactive evaluations:

Saved for future use
Don't affect current scoring
Can be reactivated when needed

Organizing Evaluations

Best practices:

Group related criteria: Keep similar evaluations together
Use clear naming: Make evaluation purposes obvious from the name
Regular review: Deactivate outdated criteria, add new ones as needs evolve
Test thoroughly: Verify evaluations work as expected before activating

Evaluation Actions

For each evaluation, you can:

Activate/Deactivate: Toggle whether it's used in scoring
Modify: Edit criteria and descriptions
Delete: Remove evaluations you no longer need
Duplicate: Create variations of existing criteria

How Evaluations Impact Optimization

Scoring Process

When your AI application processes an input:

Output generated: Your AI creates a response
Evaluations applied: Each active evaluation scores the output
Individual scores calculated: Each evaluation gets a 0-10 score
Overall score computed: Average of all evaluation scores
Results logged: Scores and reasoning saved to Event Log

Optimization Guidance

Evaluations guide both automatic and manual optimization:

Automatic optimization:

Focuses on improving the lowest-scoring evaluations
Creates Prompt Family variations to handle different criteria
Prioritizes changes that improve overall evaluation performance

Manual optimization:

Shows which specific evaluations need attention
Helps you target optimization efforts effectively
Provides clear metrics for measuring improvement

Score Interpretation

Individual evaluation scores follow the same 0-10 scale:

0-3: Evaluation criteria not met
4-6: Partially meets criteria
7-8: Meets criteria well
9-10: Exceeds criteria expectations

Common Evaluation Patterns

Accuracy-Focused Evaluations

Purpose: Ensure factual correctness and completeness

"Extracted Complete Bug Set": "All bugs mentioned also appear in the output""Accurate Details": "All extracted details were present and correct in the original text"  "No Hallucination": "Output contains no information not found in the input"

Format-Focused Evaluations

Purpose: Ensure consistent structure and presentation

"Correct Sequence": "Information appears in logical order""Proper Structure": "Output follows the specified template format""Length Requirements": "Response length falls within specified range"

Quality-Focused Evaluations

Purpose: Measure overall usefulness and appropriateness

"Addresses Question": "Response directly answers what was asked""Professional Tone": "Language is appropriate for business communication""Actionable Content": "Provides specific steps user can take"

Best Practices

Start Simple

Begin with 3-5 core evaluations covering your most important requirements
Test with sample inputs to ensure they work as expected
Run initial optimization to see how they perform
Add more specific criteria as you identify areas for improvement

Balance Coverage and Focus

Cover key areas: Accuracy, format, and quality
Avoid redundancy: Don't create overlapping evaluations
Prioritize impact: Focus on criteria that matter most to your users

Iterate Based on Results

Monitor evaluation performance in your Event Log
Identify consistently low-scoring criteria for revision
Add new evaluations when you discover edge cases
Remove or modify evaluations that aren't providing value

Test Thoroughly

Use diverse inputs when testing evaluation criteria
Check edge cases to ensure evaluations work reliably
Verify scoring matches your expectations
Get feedback from team members on evaluation clarity

Troubleshooting Evaluations

Low Scores Across All Evaluations

Possible causes:

Evaluations too strict
Prompts need optimization
Input quality issues

Solutions:

Review and adjust evaluation criteria
Run prompt optimization
Improve input examples

Inconsistent Evaluation Results

Possible causes:

Vague or subjective criteria
Evaluations testing multiple things at once

Solutions:

Make criteria more specific and measurable
Split complex evaluations into separate criteria
Add examples of what success looks like

High Scores But Poor Real-World Performance

Possible causes:

Evaluations don't match real use cases
Missing important criteria

Solutions:

Review actual end-user inputs and outputs
Add evaluations based on real-world requirements
Test with more diverse input examples

Next Steps

Now that you understand evaluations:

Set up Prompt Optimization: Use your evaluations to improve performance
Learn about Edge Case Detection: Find inputs that score poorly on your evaluations
Monitor End User Inputs: See how your evaluations perform with real data
Understand Model Optimization: Test which AI models perform best on your evaluations

PreviousModel Optimization NextEdge Case Detection

Last updated 3 months ago