Skip to main content

Understanding Accuracy

Ana avatar
Written by Ana
Updated over 5 months ago

Empromptu uses a 0-10 scoring system to measure how well your AI applications perform. Understanding these accuracy metrics is essential for optimizing your applications for production readiness

What you'll learn ⏱️ 5 minutes

  • How Empromptu's 0-10 accuracy scoring works

  • What different score ranges mean for your application

  • The difference between Initial and Current accuracy

  • How scores are calculated and updated

  • What accuracy levels you need for production deployment

  • How to interpret score improvements

The 0-10 Accuracy Scale

Empromptu measures accuracy using a 0-10 point scale where 10 represents perfect performance. Every task, evaluation, and optimization attempt receives a score within this range.

Score Ranges and Meanings:

πŸ”΄ 0-3: Low Score (Needs Improvement)

  • Significant issues with output quality

  • Frequent errors or irrelevant responses

  • Not suitable for production use

  • Requires immediate optimization attention

🟠 4-6: Medium Score (Getting Better)

  • Acceptable performance but inconsistent

  • Some correct outputs mixed with errors

  • May work for internal testing but risky for production

  • Good foundation for optimization improvements

πŸ”΅ 7-8: Good Score (Production Ready)

  • Reliable performance for most inputs

  • Occasional edge case issues but generally solid

  • Suitable for production deployment with monitoring

  • Meets business requirements for most use cases

🟒 9-10: Excellent Score (Optimal Performance)

  • Consistently high-quality outputs

  • Handles edge cases well

  • Exceeds business requirements

  • Ideal for critical business applications

Types of Accuracy Measurements

Project-Level Accuracy

Displayed on your project dashboard:

Copy

0.00% 0.00% 2Average Initial Accuracy Average Current Accuracy Total Tasks

Average Initial Accuracy: Mean of all task initial scores in the project Average Current Accuracy: Mean of all task current scores after optimization Improvement Tracking: Shows overall project optimization progress

Task-Level Accuracy

Shown in the tasks table:

Initial Accuracy: First score when the task runs with optimization Current Accuracy: Latest score after optimization attempts Improvement: Change from initial to current (+/- value)

Event-Level Accuracy

Individual optimization attempts in the Event Log:

  • Each API call receives a specific 0-10 score

  • Shows performance for individual inputs

  • Includes detailed score reasoning

  • Tracks optimization progress over time

How Accuracy Scores Are Calculated

Evaluation-Based Scoring

Your accuracy score is calculated based on active evaluations:

  1. Each evaluation gets scored individually (0-10)

  2. Individual scores are averaged together

  3. Overall score represents combined evaluation performance

  4. Score reasoning explains which evaluations passed/failed

Example Calculation:

Copy

Task has 3 active evaluations:- "Correct Sequence": 8.0- "Accurate Details": 6.5 - "Complete Summary": 7.5Overall Score: (8.0 + 6.5 + 7.5) Γ· 3 = 7.3

Score Reasoning Example:

Copy

"extracted_completeness - AI response captures the essence and key emotional elements. Summary captures key experiential aspects while maintaining vivid language."Score: 7.000

This explains exactly why the score was assigned and what criteria were evaluated.

Initial vs Current Accuracy

Initial Accuracy

When it's set: The first time your task runs through optimization What it represents: Baseline performance before any improvements Typical range: Often 3.0-6.0 for new tasks Purpose: Establishes starting point for measuring improvement

Current Accuracy

When it updates: After each optimization attempt What it represents: Latest performance level achieved Expected progression: Should increase over time with optimization Target range: 7.0+ for production readiness

Improvement Tracking

Calculation: Current Accuracy - Initial Accuracy = Improvement Examples:

  • Initial: 4.5, Current: 7.8, Improvement: +3.3 βœ… Excellent progress

  • Initial: 6.0, Current: 5.8, Improvement: -0.2 ⚠️ Needs attention

  • Initial: 3.2, Current: 8.1, Improvement: +4.9 πŸŽ‰ Outstanding optimization

What Different Scores Mean for Business

Production Readiness Guidelines:

Score 9.0+: Deploy with Confidence

  • Excellent for customer-facing applications

  • Suitable for critical business processes

  • Minimal monitoring required

  • Can handle high-volume usage

Score 7.0-8.9: Production Ready

  • Good for most business applications

  • Recommended for customer-facing use

  • Monitor performance and optimize over time

  • Suitable for moderate to high-volume usage

Score 5.0-6.9: Internal Use Only

  • Acceptable for internal tools and testing

  • Not recommended for customer-facing applications

  • Requires active optimization and monitoring

  • Good for pilot programs and validation

Score Below 5.0: Development Only

  • Not suitable for production deployment

  • Focus on optimization before considering deployment

  • Use for testing and development purposes only

  • Indicates need for significant improvement

Score Improvement Strategies

For Low Scores (0-3):

Primary focus: Fix fundamental issues

  • Review and improve evaluation criteria

  • Add more representative test inputs

  • Use automatic optimization to establish baseline

  • Check if task requirements are too complex

For Medium Scores (4-6):

Primary focus: Systematic optimization

  • Build out Prompt Families with specialized prompts

  • Use Edge Case Detection to find problem areas

  • Test different AI models for better performance

  • Add more specific evaluation criteria

For Good Scores (7-8):

Primary focus: Fine-tuning and edge cases

  • Use manual optimization for specific improvements

  • Monitor end-user inputs for new edge cases

  • Optimize for consistency across input types

  • Focus on business-critical evaluation criteria

For Excellent Scores (9-10):

Primary focus: Maintain and monitor

  • Monitor for performance degradation over time

  • Add new evaluations as requirements evolve

  • Use as baseline for similar tasks

  • Focus optimization efforts on other tasks

Common Score Patterns

Typical Optimization Journey:

Copy

Initial Build β†’ Score: N/AFirst Optimization β†’ Score: 4.5 (baseline established)Automatic Optimization β†’ Score: 6.8 (significant improvement)Manual Refinement β†’ Score: 7.9 (production ready)Edge Case Fixes β†’ Score: 8.4 (optimized)

Warning Patterns:

Declining scores over time: May indicate changing requirements or new edge cases Plateauing scores: Optimization strategy may need adjustment High variance: Inconsistent performance suggests need for better Prompt Families

Using Scores for Decision Making

Development Decisions:

  • Score below 5.0: Continue optimization before deployment

  • Score 5.0-6.9: Consider internal pilot testing

  • Score 7.0+: Proceed with production deployment planning

Optimization Priorities:

  • Focus on lowest-scoring tasks first for maximum impact

  • Address tasks with declining scores to prevent issues

  • Optimize high-volume tasks to improve overall metrics

Business Communication:

  • Use score improvements to demonstrate AI initiative success

  • Set score targets for business stakeholders (e.g., "achieve 7.5+ before launch")

  • Track score trends to show continuous improvement

Troubleshooting Accuracy Issues

Scores Not Updating:

Check: Task is active and optimization is running Solution: Ensure evaluations are active and inputs are available

Inconsistent Score Ranges:

Check: Evaluation criteria clarity and representativeness Solution: Review and refine evaluation definitions

Scores Lower Than Expected:

Check: Task complexity vs evaluation criteria alignment Solution: Simplify task scope or adjust evaluation expectations

Cannot Achieve High Scores:

Check: Input quality and evaluation criteria realism Solution: Add better test inputs and review evaluation criteria

Best Practices for Accuracy Management

Set Realistic Targets:

  • New tasks: Target 6.0+ for initial success

  • Production tasks: Aim for 7.5+ for reliability

  • Critical tasks: Strive for 8.5+ for excellence

Monitor Continuously:

  • Check scores weekly for production tasks

  • Review trends monthly for optimization planning

  • Investigate drops immediately to prevent issues

Document Learning:

  • Track what works for achieving high scores

  • Note optimization strategies that deliver results

  • Share successful approaches across tasks and projects

Next Steps

Now that you understand accuracy scoring:

  • Start Optimizing: Learn how to improve your accuracy scores systematically

  • Set Up Evaluations: Create criteria that drive meaningful accuracy measurements

  • Use Task Actions: Access the tools you need to improve performance

  • Learn Prompt Optimization: Master the core technology for accuracy improvement

Did this answer your question?