Skip to content

Evaluator

Relevant Source Files

The Evaluator is an autonomous service in the Templar framework that continuously assesses model performance by running standardized benchmark tasks on the latest checkpoints. It operates independently from miners and validators, providing objective measurements of model capabilities throughout the training process.

Sources: scripts/evaluator.py:1-40

The Evaluator performs the following key functions:

  1. Monitors for new model checkpoints by window number
  2. Downloads and loads the latest checkpoint
  3. Executes standardized benchmark tasks (e.g., ARC, MMLU, Winogrande)
  4. Logs evaluation results to InfluxDB for monitoring and analysis
  5. Manages resources efficiently for continuous operation

Unlike miners that generate gradients or validators that assess gradient quality, the Evaluator focuses exclusively on measuring the end-to-end capabilities of the current model against established benchmarks.

flowchart TD
    subgraph "Evaluator Service"
        Monitor["Monitor Checkpoints"] --> Detect["Detect New Checkpoint"]
        Detect --> Load["Load Latest Model"]
        Load --> Evaluate["Run Benchmarks"]
        Evaluate --> Log["Log Results"]
        Log --> Cleanup["Clean Resources"]
        Cleanup --> Monitor
    end

    subgraph "External Components"
        R2["Checkpoint Storage"]
        InfluxDB["InfluxDB Metrics"]
        LMEval["LM Evaluation Harness"]
    end

    Load <--> R2
    Evaluate <--> LMEval
    Log --> InfluxDB

Sources: scripts/evaluator.py:125-147 , scripts/evaluator.py:559-568

The Evaluator operates as a standalone service that interacts with the Templar checkpoint storage system and metrics infrastructure. It leverages the Language Model Evaluation Harness (lm-eval) to execute standardized benchmarks.

classDiagram
    class Evaluator {
        -config: bt.Config
        -netuid: int
        -model: LlamaForCausalLM
        -metrics_logger: MetricsLogger
        -last_eval_window: int
        -comms: tplr.comms.Comms
        +__init__()
        +update_state()
        +load_latest_model()
        +_run_lm_eval()
        +_process_results()
        +_evaluate()
        +run()
        +cleanup()
    }

    class Comms {
        +get_latest_checkpoint()
        +get_commitments()
        +update_peers_with_buckets()
        +get_start_window()
    }

    class MetricsLogger {
        +log()
    }

    class LlamaForCausalLM {
        +save_pretrained()
        +load_state_dict()
    }

    Evaluator --> Comms : uses
    Evaluator --> MetricsLogger : logs metrics via
    Evaluator --> LlamaForCausalLM : evaluates

Sources: scripts/evaluator.py:125-198

The Evaluator continuously monitors the blockchain for the latest checkpoint by window number. When a new checkpoint is detected, it downloads and loads the model weights.

sequenceDiagram
    participant Evaluator
    participant Comms as "Comms Module"
    participant R2 as "R2 Storage"
    participant Model as "LLaMA Model"
    
    Evaluator->>Comms: update_state()
    Evaluator->>Comms: get_start_window()
    Comms-->>Evaluator: current_window
    
    alt new checkpoint available
        Evaluator->>Comms: get_latest_checkpoint(version)
        Comms->>R2: fetch latest checkpoint
        R2-->>Comms: checkpoint data
        Comms-->>Evaluator: checkpoint_data, metadata
        
        Evaluator->>Model: load_state_dict(checkpoint_data["model_state_dict"])
        Evaluator->>Evaluator: Store momentum data
    else no new checkpoint
        Evaluator->>Evaluator: Wait for next interval
    end

Sources: scripts/evaluator.py:211-282 , tests/test_evaluator.py:59-83

When a new checkpoint is loaded, the Evaluator executes a series of benchmark tasks using the LM Evaluation Harness (lm-eval). Different tasks can be configured and executed on a schedule.

flowchart TD
    subgraph "Evaluator._evaluate()"
        LoadModel["Load Latest Model"] --> SaveTemp["Save Model to Temp Location"]
        SaveTemp --> RunTasks["Run Benchmark Tasks"]
        RunTasks --> ProcessResults["Process Results"]
        ProcessResults --> CleanupTemp["Clean Up Temp Files"]
    end
    
    subgraph "Benchmark Tasks"
        RegularTasks["Regular Tasks\narc_challenge, arc_easy\npiqa, winogrande, etc."]
        MMLU["MMLU (Full)\nPeriodically on 4th run"]
    end
    
    RunTasks --> RegularTasks
    RunTasks --> MMLU
    
    RegularTasks --> |"_run_lm_eval()"| LMEval["lm-eval Command"]
    MMLU --> |"_run_lm_eval() with few-shot"| LMEval
    
    LMEval --> Results["Results JSON"]
    Results --> |"_process_results()"| Metrics["Parse Metrics"]
    Metrics --> InfluxDB["Log to InfluxDB"]

Sources: scripts/evaluator.py:446-557

After each benchmark run, the Evaluator processes the results by:

  1. Parsing the JSON output from lm-eval
  2. Extracting relevant metrics (accuracy scores)
  3. Logging both individual task results and summary metrics to InfluxDB

The system prioritizes certain metrics in a specific order (e.g., acc_norm over acc) based on their relevance and reliability.

Sources: scripts/evaluator.py:338-441

The Evaluator supports multiple benchmark tasks that assess different capabilities of the language model:

TaskDescriptionMetricExecution Mode
arc_challengeAI2 Reasoning Challenge (hard)acc/acc_normZero-shot
arc_easyAI2 Reasoning Challenge (easy)acc/acc_normZero-shot
winograndeWinograd Schema Challengeacc/acc_normZero-shot
piqaPhysical Interaction QAacc/acc_normZero-shot
hellaswagCommonsense NLIacc/acc_normZero-shot
openbookqaOpen Book Question Answeringacc/acc_normZero-shot
mmluMassive Multitask Language Understandingacc/acc_normBoth zero-shot and 5-shot

For MMLU specifically, the Evaluator can run it in two modes:

  • Regular zero-shot evaluation with other tasks
  • Periodic 5-shot evaluation (every 4th run) with different configuration

Sources: scripts/evaluator.py:89-94 , scripts/evaluator.py:493-545

The Evaluator runs benchmarks by:

  1. Saving the model to a temporary location
  2. Executing the lm-eval command with appropriate parameters
  3. Collecting and parsing results from the output JSON
flowchart LR
    subgraph "Evaluator._run_lm_eval"
        BuildCmd["Build lm-eval Command"] --> RunCmd["Execute Command"]
        RunCmd --> TrackTime["Track Runtime"]
        TrackTime --> ReturnResults["Return Results"]
    end
    
    subgraph "Command Parameters"
        Model["--model hf"]
        ModelArgs["--model_args pretrained=path"]
        Tasks["--tasks task1,task2"]
        Device["--device cuda:x"]
        BatchSize["--batch_size 8"]
        Output["--output_path path"]
        FewShot["--num_fewshot (optional)"]
        Limit["--limit (optional)"]
    end
    
    BuildCmd --> Model
    BuildCmd --> ModelArgs
    BuildCmd --> Tasks
    BuildCmd --> Device
    BuildCmd --> BatchSize
    BuildCmd --> Output
    BuildCmd --> FewShot
    BuildCmd --> Limit

Sources: scripts/evaluator.py:284-336

The Evaluator provides several configuration options to customize its behavior:

ParameterDescriptionDefault
—netuidBittensor network UID3
—actual_batch_sizeEvaluation batch size8
—deviceDevice to use for evaluationcuda:7
—tasksComma-separated list of tasksarc_challenge,arc_easy,openbookqa,winogrande,piqa,hellaswag,mmlu
—checkpoint_pathPath to save/load checkpointscheckpoints/
—eval_intervalSeconds between evaluations600 (10 mins)
—uidOverride the wallet’s UIDNone
—skip-gapsSkip gaps in the evaluation processFalse

Sources: scripts/evaluator.py:61-122

The Evaluator is designed to run as a continuous service. It can be deployed using Docker via the provided compose file.

The repository includes a Docker Compose configuration for deploying the Evaluator:

# Key elements from the compose.yml file:
# - Uses the Templar container image
# - Maps Bittensor wallet directory
# - Sets up required environment variables
# - Configures GPU access
# - Sets up logging with journald
# - Includes watchtower for automatic updates

Sources: scripts/evaluator-setup/compose.yml:1-33

The Evaluator requires several environment variables:

  • R2_DATASET_ACCOUNT_ID: R2 dataset account identifier
  • R2_DATASET_BUCKET_NAME: R2 storage bucket name
  • R2_DATASET_READ_ACCESS_KEY_ID: R2 read access key
  • R2_DATASET_READ_SECRET_ACCESS_KEY: R2 secret access key
  • INFLUXDB_TOKEN: InfluxDB API token (optional)

Sources: scripts/evaluator.py:15-25

The Evaluator logs detailed metrics to InfluxDB:

  1. Benchmark Metrics: Runtime performance and execution details
  2. Task Results: Individual scores for each benchmark task
  3. Summary Metrics: Aggregated statistics across all tasks
flowchart TD
    subgraph "Metrics Collection"
        Benchmark["Benchmark Metrics\n- Runtime\n- Exit codes"]
        TaskScores["Task Scores\n- acc/acc_norm by task"]
        Summary["Summary Metrics\n- Number of tasks\n- Window/block info"]
    end
    
    subgraph "MetricsLogger"
        Log["log() method"]
        Tags["Metric Tags\n- window\n- block\n- global_step\n- task"]
        Fields["Metric Fields\n- scores\n- runtime\n- count"]
    end
    
    subgraph "Storage"
        InfluxDB["InfluxDB\nTime-series database"]
    end
    
    Benchmark --> Log
    TaskScores --> Log
    Summary --> Log
    
    Log --> Tags
    Log --> Fields
    
    Log --> InfluxDB

Sources: scripts/evaluator.py:368-441

The Evaluator integrates with several other components of the Templar framework:

  1. Comms System: For checkpoint retrieval and blockchain interaction
  2. Metrics System: For logging evaluation results
  3. Storage System: For accessing model checkpoints

It operates independently of miners and validators but provides crucial feedback on the quality of the model being trained by the network.

Sources: scripts/evaluator.py:184-196