Evaluations Overview

Evaluations define what "success" looks like for your AI application. They're the criteria Empromptu uses to measure performance and guide optimization.

What you'll learn ⏱️ 5 minutes

  • How to create effective evaluation criteria

  • When to use manual vs automatic evaluation creation

  • How to manage and organize your evaluations

  • Best practices for writing evaluation criteria

  • How evaluations impact optimization results

What Are Evaluations?

Evaluations are specific, measurable criteria that define good performance for your AI application. Instead of subjective judgment, evaluations provide objective benchmarks for optimization.

Example: Review Summarizer Evaluations

  • "Correct Sequence": Information appears in logical order that reflects the structure of the input

  • "Accurate Details": All extracted details match what was present in the original text

  • "Complete Summary": The summary captures all key points without omitting important information

  • "Appropriate Length": Summary length matches the specified requirements

Each evaluation gets scored individually, contributing to your overall accuracy metrics.

Creating Evaluations

You have two options for creating evaluations:

Automatic Generation

Best for: Getting started quickly with proven evaluation criteria.

How it works:

  1. Click "Add Evaluation"

  2. Select "Generate Automatically"

  3. Empromptu analyzes your task and creates relevant evaluations

  4. Review and activate the generated criteria

Benefits:

  • Fast setup

  • Proven criteria based on similar use cases

  • Good starting point for further customization

Manual Creation

Best for: Specific requirements or fine-tuned control over success criteria.

Process:

  1. Click "Add Evaluation"

  2. Select "Create Manually"

  3. Write your evaluation name and criteria

  4. Test with sample inputs

  5. Activate when satisfied

Benefits:

  • Complete control over criteria

  • Task-specific requirements

  • Custom business logic

Writing Effective Evaluation Criteria

Be Specific and Measurable

Poor: "Output should be good" ✅ Good: "Summary should include all product features mentioned in the review"

Poor: "Response should be helpful" ✅ Good: "Response should provide at least 2 actionable solutions to the customer's problem"

Focus on Observable Outcomes

Your criteria should describe what you can clearly identify in the output:

Format requirements:

  • "Information appears in logical sequence"

  • "Response includes exactly 3 bullet points"

  • "Output contains no more than 150 words"

Content requirements:

  • "All bugs mentioned in input appear in the output"

  • "Summary preserves technical terminology from original text"

  • "Response addresses the specific question asked"

Quality requirements:

  • "Extracted details match what was in the original text"

  • "No hallucinated information appears in the response"

  • "Tone remains professional and helpful"

Examples by Use Case

Data Extraction Applications

  • "Complete Extraction": "All contact information present in the document appears in the structured output"

  • "Accurate Formatting": "Phone numbers follow (XXX) XXX-XXXX format"

  • "No Duplication": "Each piece of information appears only once in the output"

Customer Support Applications

  • "Question Recognition": "Response demonstrates understanding of the customer's specific issue"

  • "Solution Provided": "Response includes at least one actionable step to resolve the problem"

  • "Appropriate Escalation": "Complex technical issues are escalated to human agents"

Content Generation Applications

  • "Brand Voice": "Content matches the company's established tone and style"

  • "Factual Accuracy": "All claims in the content can be verified from provided sources"

  • "Target Length": "Content falls within specified word count requirements"

Managing Your Evaluations

Evaluation Status

Each evaluation can be Active or Inactive:

Active evaluations:

  • Used in optimization scoring

  • Contribute to overall accuracy metrics

  • Guide automatic optimization decisions

Inactive evaluations:

  • Saved for future use

  • Don't affect current scoring

  • Can be reactivated when needed

Organizing Evaluations

Best practices:

  • Group related criteria: Keep similar evaluations together

  • Use clear naming: Make evaluation purposes obvious from the name

  • Regular review: Deactivate outdated criteria, add new ones as needs evolve

  • Test thoroughly: Verify evaluations work as expected before activating

Evaluation Actions

For each evaluation, you can:

  • Activate/Deactivate: Toggle whether it's used in scoring

  • Modify: Edit criteria and descriptions

  • Delete: Remove evaluations you no longer need

  • Duplicate: Create variations of existing criteria

How Evaluations Impact Optimization

Scoring Process

When your AI application processes an input:

  1. Output generated: Your AI creates a response

  2. Evaluations applied: Each active evaluation scores the output

  3. Individual scores calculated: Each evaluation gets a 0-10 score

  4. Overall score computed: Average of all evaluation scores

  5. Results logged: Scores and reasoning saved to Event Log

Optimization Guidance

Evaluations guide both automatic and manual optimization:

Automatic optimization:

  • Focuses on improving the lowest-scoring evaluations

  • Creates Prompt Family variations to handle different criteria

  • Prioritizes changes that improve overall evaluation performance

Manual optimization:

  • Shows which specific evaluations need attention

  • Helps you target optimization efforts effectively

  • Provides clear metrics for measuring improvement

Score Interpretation

Individual evaluation scores follow the same 0-10 scale:

  • 0-3: Evaluation criteria not met

  • 4-6: Partially meets criteria

  • 7-8: Meets criteria well

  • 9-10: Exceeds criteria expectations

Common Evaluation Patterns

Accuracy-Focused Evaluations

Purpose: Ensure factual correctness and completeness

"Extracted Complete Bug Set": "All bugs mentioned also appear in the output""Accurate Details": "All extracted details were present and correct in the original text"  "No Hallucination": "Output contains no information not found in the input"

Format-Focused Evaluations

Purpose: Ensure consistent structure and presentation

"Correct Sequence": "Information appears in logical order""Proper Structure": "Output follows the specified template format""Length Requirements": "Response length falls within specified range"

Quality-Focused Evaluations

Purpose: Measure overall usefulness and appropriateness

"Addresses Question": "Response directly answers what was asked""Professional Tone": "Language is appropriate for business communication""Actionable Content": "Provides specific steps user can take"

Best Practices

Start Simple

  1. Begin with 3-5 core evaluations covering your most important requirements

  2. Test with sample inputs to ensure they work as expected

  3. Run initial optimization to see how they perform

  4. Add more specific criteria as you identify areas for improvement

Balance Coverage and Focus

  • Cover key areas: Accuracy, format, and quality

  • Avoid redundancy: Don't create overlapping evaluations

  • Prioritize impact: Focus on criteria that matter most to your users

Iterate Based on Results

  • Monitor evaluation performance in your Event Log

  • Identify consistently low-scoring criteria for revision

  • Add new evaluations when you discover edge cases

  • Remove or modify evaluations that aren't providing value

Test Thoroughly

  • Use diverse inputs when testing evaluation criteria

  • Check edge cases to ensure evaluations work reliably

  • Verify scoring matches your expectations

  • Get feedback from team members on evaluation clarity

Troubleshooting Evaluations

Low Scores Across All Evaluations

Possible causes:

  • Evaluations too strict

  • Prompts need optimization

  • Input quality issues

Solutions:

  • Review and adjust evaluation criteria

  • Run prompt optimization

  • Improve input examples

Inconsistent Evaluation Results

Possible causes:

  • Vague or subjective criteria

  • Evaluations testing multiple things at once

Solutions:

  • Make criteria more specific and measurable

  • Split complex evaluations into separate criteria

  • Add examples of what success looks like

High Scores But Poor Real-World Performance

Possible causes:

  • Evaluations don't match real use cases

  • Missing important criteria

Solutions:

  • Review actual end-user inputs and outputs

  • Add evaluations based on real-world requirements

  • Test with more diverse input examples

Next Steps

Now that you understand evaluations:

  • Set up Prompt Optimization: Use your evaluations to improve performance

  • Learn about Edge Case Detection: Find inputs that score poorly on your evaluations

  • Monitor End User Inputs: See how your evaluations perform with real data

  • Understand Model Optimization: Test which AI models perform best on your evaluations

Last updated