Understanding Accuracy

Empromptu uses a 0-10 scoring system to measure how well your AI applications perform tasks. Understanding these accuracy metrics is essential for optimizing your applications for production readiness

What you'll learn ⏱️ 5 minutes

  • How to define accuracy

  • How Empromptu's 0-10 accuracy scoring works

  • What different score ranges mean for your application

  • The difference between Initial and Current accuracy

  • How scores are calculated and updated

  • What accuracy levels you need for production deployment

  • How to interpret score improvements

How to define accuracy

In order to define your task accuracy, you need to set an evaluation. Once you set an evaluation, you will measure how successfully your inputs and prompts achieve that goal.

The 0-10 Accuracy Scale

Empromptu measures accuracy using a 0-10 point scale where 10 represents perfect performance. Every task, evaluation, and optimization attempt receives a score within this range.

Score Ranges and Meanings:

🔴 0-3: Low Score (Needs Improvement)

  • Significant issues with output quality

  • Frequent errors or irrelevant responses

  • Not suitable for production use

  • Requires immediate optimization attention

🟠 4-6: Medium Score (Getting Better)

  • Acceptable performance but inconsistent

  • Some correct outputs mixed with errors

  • May work for internal testing but risky for production

  • Good foundation for optimization improvements

🔵 7-8: Good Score (Production Ready)

  • Reliable performance for most inputs

  • Occasional edge case issues but generally solid

  • Suitable for production deployment with monitoring

  • Meets business requirements for most use cases

🟢 9-10: Excellent Score (Optimal Performance)

  • Consistently high-quality outputs

  • Handles edge cases well

  • Exceeds business requirements

  • Ideal for critical business applications

Types of Accuracy Measurements

Project-Level Accuracy

Displayed on your project dashboard:

Average Initial Accuracy: Mean of all task initial scores in the project

Average Current Accuracy: Mean of all task current scores after optimization

Improvement Tracking: Shows overall project optimization progress

Task-Level Accuracy

Shown in the tasks table:

Initial Accuracy: First score when the task runs with optimization Current Accuracy: Latest score after optimization attempts Improvement: Change from initial to current (+/- value)

How Accuracy Scores Are Calculated

Evaluation-Based Scoring

Your accuracy score is calculated based on active evaluations:

  1. Each evaluation gets scored individually (0-10)

  2. Individual scores are averaged together

  3. Overall score represents combined evaluation performance

  4. Score reasoning explains which evaluations passed/failed

Example Calculation:

Task has 3 active evaluations:- "Correct Sequence": 8.0- "Accurate Details": 6.5  - "Complete Summary": 7.5Overall Score: (8.0 + 6.5 + 7.5) ÷ 3 = 7.3

Score Reasoning Example:

"extracted_completeness - AI response captures the essence and key emotional elements. Summary captures key experiential aspects while maintaining vivid language."Score: 7.000

This explains exactly why the score was assigned and what criteria were evaluated.

Initial vs Current Accuracy

Initial Accuracy

When it's set: The first time your task runs through optimization What it represents: Baseline performance before any improvements Typical range: Often 3.0-6.0 for new tasks Purpose: Establishes starting point for measuring improvement

Current Accuracy

When it updates: After each optimization attempt What it represents: Latest performance level achieved Expected progression: Should increase over time with optimization Target range: 7.0+ for production readiness

Improvement Tracking

Calculation: Current Accuracy - Initial Accuracy = Improvement Examples:

  • Initial: 4.5, Current: 7.8, Improvement: +3.3 ✅ Excellent progress

  • Initial: 6.0, Current: 5.8, Improvement: -0.2 ⚠️ Needs attention

  • Initial: 3.2, Current: 8.1, Improvement: +4.9 🎉 Outstanding optimization

What Different Scores Mean for Business

Production Readiness Guidelines:

Score 9.0+: Deploy with Confidence

  • Excellent for customer-facing applications

  • Suitable for critical business processes

  • Minimal monitoring required

  • Can handle high-volume usage

Score 7.0-8.9: Production Ready

  • Good for most business applications

  • Recommended for customer-facing use

  • Monitor performance and optimize over time

  • Suitable for moderate to high-volume usage

Score 5.0-6.9: Internal Use Only

  • Acceptable for internal tools and testing

  • Not recommended for customer-facing applications

  • Requires active optimization and monitoring

  • Good for pilot programs and validation

Score Below 5.0: Development Only

  • Not suitable for production deployment

  • Focus on optimization before considering deployment

  • Use for testing and development purposes only

  • Indicates need for significant improvement

Score Improvement Strategies

For Low Scores (0-3):

Primary focus: Fix fundamental issues

  • Review and improve evaluation criteria

  • Add more representative test inputs

  • Use automatic optimization to establish baseline

  • Check if task requirements are too complex

For Medium Scores (4-6):

Primary focus: Systematic optimization

  • Build out Prompt Families with specialized prompts

  • Use Edge Case Detection to find problem areas

  • Test different AI models for better performance

  • Add more specific evaluation criteria

For Good Scores (7-8):

Primary focus: Fine-tuning and edge cases

  • Use manual optimization for specific improvements

  • Monitor end-user inputs for new edge cases

  • Optimize for consistency across input types

  • Focus on business-critical evaluation criteria

For Excellent Scores (9-10):

Primary focus: Maintain and monitor

  • Monitor for performance degradation over time

  • Add new evaluations as requirements evolve

  • Use as baseline for similar tasks

  • Focus optimization efforts on other tasks

Common Score Patterns

Typical Optimization Journey:

Initial Build → Score: N/AFirst Optimization → Score: 4.5 (baseline established)Automatic Optimization → Score: 6.8 (significant improvement)Manual Refinement → Score: 7.9 (production ready)Edge Case Fixes → Score: 8.4 (optimized)

Warning Patterns:

Declining scores over time: May indicate changing requirements or new edge cases Plateauing scores: Optimization strategy may need adjustment High variance: Inconsistent performance suggests need for better Prompt Families

Using Scores for Decision Making

Development Decisions:

  • Score below 5.0: Continue optimization before deployment

  • Score 5.0-6.9: Consider internal pilot testing

  • Score 7.0+: Proceed with production deployment planning

Optimization Priorities:

  • Focus on lowest-scoring tasks first for maximum impact

  • Address tasks with declining scores to prevent issues

  • Optimize high-volume tasks to improve overall metrics

Business Communication:

  • Use score improvements to demonstrate AI initiative success

  • Set score targets for business stakeholders (e.g., "achieve 7.5+ before launch")

  • Track score trends to show continuous improvement

Troubleshooting Accuracy Issues

Scores Not Updating:

Check: Task is active and optimization is running Solution: Ensure evaluations are active and inputs are available

Inconsistent Score Ranges:

Check: Evaluation criteria clarity and representativeness Solution: Review and refine evaluation definitions

Scores Lower Than Expected:

Check: Task complexity vs evaluation criteria alignment Solution: Simplify task scope or adjust evaluation expectations

Cannot Achieve High Scores:

Check: Input quality and evaluation criteria realism Solution: Add better test inputs and review evaluation criteria

Best Practices for Accuracy Management

Set Realistic Targets:

  • New tasks: Target 6.0+ for initial success

  • Production tasks: Aim for 7.5+ for reliability

  • Critical tasks: Strive for 8.5+ for excellence

Monitor Continuously:

  • Check scores weekly for production tasks

  • Review trends monthly for optimization planning

  • Investigate drops immediately to prevent issues

Document Learning:

  • Track what works for achieving high scores

  • Note optimization strategies that deliver results

  • Share successful approaches across tasks and projects

Next Steps

Now that you understand accuracy scoring:

  • Start Optimizing: Learn how to improve your accuracy scores systematically

  • Set Up Evaluations: Create criteria that drive meaningful accuracy measurements

  • Use Task Actions: Access the tools you need to improve performance

  • Learn Prompt Optimization: Master the core technology for accuracy improvement

Last updated