Understanding Accuracy
Empromptu uses a 0-10 scoring system to measure how well your AI applications perform tasks. Understanding these accuracy metrics is essential for optimizing your applications for production readiness
What you'll learn ⏱️ 5 minutes
How to define accuracy
How Empromptu's 0-10 accuracy scoring works
What different score ranges mean for your application
The difference between Initial and Current accuracy
How scores are calculated and updated
What accuracy levels you need for production deployment
How to interpret score improvements
How to define accuracy
In order to define your task accuracy, you need to set an evaluation. Once you set an evaluation, you will measure how successfully your inputs and prompts achieve that goal.
The 0-10 Accuracy Scale
Empromptu measures accuracy using a 0-10 point scale where 10 represents perfect performance. Every task, evaluation, and optimization attempt receives a score within this range.
Score Ranges and Meanings:
🔴 0-3: Low Score (Needs Improvement)
Significant issues with output quality
Frequent errors or irrelevant responses
Not suitable for production use
Requires immediate optimization attention
🟠 4-6: Medium Score (Getting Better)
Acceptable performance but inconsistent
Some correct outputs mixed with errors
May work for internal testing but risky for production
Good foundation for optimization improvements
🔵 7-8: Good Score (Production Ready)
Reliable performance for most inputs
Occasional edge case issues but generally solid
Suitable for production deployment with monitoring
Meets business requirements for most use cases
🟢 9-10: Excellent Score (Optimal Performance)
Consistently high-quality outputs
Handles edge cases well
Exceeds business requirements
Ideal for critical business applications
Types of Accuracy Measurements
Project-Level Accuracy
Displayed on your project dashboard:

Average Initial Accuracy: Mean of all task initial scores in the project
Average Current Accuracy: Mean of all task current scores after optimization
Improvement Tracking: Shows overall project optimization progress
Task-Level Accuracy
Shown in the tasks table:

Initial Accuracy: First score when the task runs with optimization Current Accuracy: Latest score after optimization attempts Improvement: Change from initial to current (+/- value)
How Accuracy Scores Are Calculated
Evaluation-Based Scoring
Your accuracy score is calculated based on active evaluations:
Each evaluation gets scored individually (0-10)
Individual scores are averaged together
Overall score represents combined evaluation performance
Score reasoning explains which evaluations passed/failed
Example Calculation:
Task has 3 active evaluations:- "Correct Sequence": 8.0- "Accurate Details": 6.5 - "Complete Summary": 7.5Overall Score: (8.0 + 6.5 + 7.5) ÷ 3 = 7.3Score Reasoning Example:
"extracted_completeness - AI response captures the essence and key emotional elements. Summary captures key experiential aspects while maintaining vivid language."Score: 7.000This explains exactly why the score was assigned and what criteria were evaluated.
Initial vs Current Accuracy
Initial Accuracy
When it's set: The first time your task runs through optimization What it represents: Baseline performance before any improvements Typical range: Often 3.0-6.0 for new tasks Purpose: Establishes starting point for measuring improvement
Current Accuracy
When it updates: After each optimization attempt What it represents: Latest performance level achieved Expected progression: Should increase over time with optimization Target range: 7.0+ for production readiness
Improvement Tracking
Calculation: Current Accuracy - Initial Accuracy = Improvement Examples:
Initial: 4.5, Current: 7.8, Improvement: +3.3 ✅ Excellent progress
Initial: 6.0, Current: 5.8, Improvement: -0.2 ⚠️ Needs attention
Initial: 3.2, Current: 8.1, Improvement: +4.9 🎉 Outstanding optimization
What Different Scores Mean for Business
Production Readiness Guidelines:
Score 9.0+: Deploy with Confidence
Excellent for customer-facing applications
Suitable for critical business processes
Minimal monitoring required
Can handle high-volume usage
Score 7.0-8.9: Production Ready
Good for most business applications
Recommended for customer-facing use
Monitor performance and optimize over time
Suitable for moderate to high-volume usage
Score 5.0-6.9: Internal Use Only
Acceptable for internal tools and testing
Not recommended for customer-facing applications
Requires active optimization and monitoring
Good for pilot programs and validation
Score Below 5.0: Development Only
Not suitable for production deployment
Focus on optimization before considering deployment
Use for testing and development purposes only
Indicates need for significant improvement
Score Improvement Strategies
For Low Scores (0-3):
Primary focus: Fix fundamental issues
Review and improve evaluation criteria
Add more representative test inputs
Use automatic optimization to establish baseline
Check if task requirements are too complex
For Medium Scores (4-6):
Primary focus: Systematic optimization
Build out Prompt Families with specialized prompts
Use Edge Case Detection to find problem areas
Test different AI models for better performance
Add more specific evaluation criteria
For Good Scores (7-8):
Primary focus: Fine-tuning and edge cases
Use manual optimization for specific improvements
Monitor end-user inputs for new edge cases
Optimize for consistency across input types
Focus on business-critical evaluation criteria
For Excellent Scores (9-10):
Primary focus: Maintain and monitor
Monitor for performance degradation over time
Add new evaluations as requirements evolve
Use as baseline for similar tasks
Focus optimization efforts on other tasks
Common Score Patterns
Typical Optimization Journey:
Initial Build → Score: N/AFirst Optimization → Score: 4.5 (baseline established)Automatic Optimization → Score: 6.8 (significant improvement)Manual Refinement → Score: 7.9 (production ready)Edge Case Fixes → Score: 8.4 (optimized)Warning Patterns:
Declining scores over time: May indicate changing requirements or new edge cases Plateauing scores: Optimization strategy may need adjustment High variance: Inconsistent performance suggests need for better Prompt Families
Using Scores for Decision Making
Development Decisions:
Score below 5.0: Continue optimization before deployment
Score 5.0-6.9: Consider internal pilot testing
Score 7.0+: Proceed with production deployment planning
Optimization Priorities:
Focus on lowest-scoring tasks first for maximum impact
Address tasks with declining scores to prevent issues
Optimize high-volume tasks to improve overall metrics
Business Communication:
Use score improvements to demonstrate AI initiative success
Set score targets for business stakeholders (e.g., "achieve 7.5+ before launch")
Track score trends to show continuous improvement
Troubleshooting Accuracy Issues
Scores Not Updating:
Check: Task is active and optimization is running Solution: Ensure evaluations are active and inputs are available
Inconsistent Score Ranges:
Check: Evaluation criteria clarity and representativeness Solution: Review and refine evaluation definitions
Scores Lower Than Expected:
Check: Task complexity vs evaluation criteria alignment Solution: Simplify task scope or adjust evaluation expectations
Cannot Achieve High Scores:
Check: Input quality and evaluation criteria realism Solution: Add better test inputs and review evaluation criteria
Best Practices for Accuracy Management
Set Realistic Targets:
New tasks: Target 6.0+ for initial success
Production tasks: Aim for 7.5+ for reliability
Critical tasks: Strive for 8.5+ for excellence
Monitor Continuously:
Check scores weekly for production tasks
Review trends monthly for optimization planning
Investigate drops immediately to prevent issues
Document Learning:
Track what works for achieving high scores
Note optimization strategies that deliver results
Share successful approaches across tasks and projects
Next Steps
Now that you understand accuracy scoring:
Start Optimizing: Learn how to improve your accuracy scores systematically
Set Up Evaluations: Create criteria that drive meaningful accuracy measurements
Use Task Actions: Access the tools you need to improve performance
Learn Prompt Optimization: Master the core technology for accuracy improvement
Last updated