Evaluator
The Evaluator is an autonomous service in the Templar framework that continuously assesses model performance by running standardized benchmark tasks on the latest checkpoints. It operates independently from miners and validators, providing objective measurements of model capabilities throughout the training process.
Sources: scripts/evaluator.py:1-40
Overview
Section titled “Overview”The Evaluator performs the following key functions:
- Monitors for new model checkpoints by window number
- Downloads and loads the latest checkpoint
- Executes standardized benchmark tasks (e.g., ARC, MMLU, Winogrande)
- Logs evaluation results to InfluxDB for monitoring and analysis
- Manages resources efficiently for continuous operation
Unlike miners that generate gradients or validators that assess gradient quality, the Evaluator focuses exclusively on measuring the end-to-end capabilities of the current model against established benchmarks.
flowchart TD subgraph "Evaluator Service" Monitor["Monitor Checkpoints"] --> Detect["Detect New Checkpoint"] Detect --> Load["Load Latest Model"] Load --> Evaluate["Run Benchmarks"] Evaluate --> Log["Log Results"] Log --> Cleanup["Clean Resources"] Cleanup --> Monitor end subgraph "External Components" R2["Checkpoint Storage"] InfluxDB["InfluxDB Metrics"] LMEval["LM Evaluation Harness"] end Load <--> R2 Evaluate <--> LMEval Log --> InfluxDB
Sources: scripts/evaluator.py:125-147 , scripts/evaluator.py:559-568
Architecture
Section titled “Architecture”The Evaluator operates as a standalone service that interacts with the Templar checkpoint storage system and metrics infrastructure. It leverages the Language Model Evaluation Harness (lm-eval) to execute standardized benchmarks.
classDiagram class Evaluator { -config: bt.Config -netuid: int -model: LlamaForCausalLM -metrics_logger: MetricsLogger -last_eval_window: int -comms: tplr.comms.Comms +__init__() +update_state() +load_latest_model() +_run_lm_eval() +_process_results() +_evaluate() +run() +cleanup() } class Comms { +get_latest_checkpoint() +get_commitments() +update_peers_with_buckets() +get_start_window() } class MetricsLogger { +log() } class LlamaForCausalLM { +save_pretrained() +load_state_dict() } Evaluator --> Comms : uses Evaluator --> MetricsLogger : logs metrics via Evaluator --> LlamaForCausalLM : evaluates
Sources: scripts/evaluator.py:125-198
Evaluation Workflow
Section titled “Evaluation Workflow”Checkpoint Detection and Loading
Section titled “Checkpoint Detection and Loading”The Evaluator continuously monitors the blockchain for the latest checkpoint by window number. When a new checkpoint is detected, it downloads and loads the model weights.
sequenceDiagram participant Evaluator participant Comms as "Comms Module" participant R2 as "R2 Storage" participant Model as "LLaMA Model" Evaluator->>Comms: update_state() Evaluator->>Comms: get_start_window() Comms-->>Evaluator: current_window alt new checkpoint available Evaluator->>Comms: get_latest_checkpoint(version) Comms->>R2: fetch latest checkpoint R2-->>Comms: checkpoint data Comms-->>Evaluator: checkpoint_data, metadata Evaluator->>Model: load_state_dict(checkpoint_data["model_state_dict"]) Evaluator->>Evaluator: Store momentum data else no new checkpoint Evaluator->>Evaluator: Wait for next interval end
Sources: scripts/evaluator.py:211-282 , tests/test_evaluator.py:59-83
Benchmark Execution
Section titled “Benchmark Execution”When a new checkpoint is loaded, the Evaluator executes a series of benchmark tasks using the LM Evaluation Harness (lm-eval). Different tasks can be configured and executed on a schedule.
flowchart TD subgraph "Evaluator._evaluate()" LoadModel["Load Latest Model"] --> SaveTemp["Save Model to Temp Location"] SaveTemp --> RunTasks["Run Benchmark Tasks"] RunTasks --> ProcessResults["Process Results"] ProcessResults --> CleanupTemp["Clean Up Temp Files"] end subgraph "Benchmark Tasks" RegularTasks["Regular Tasks\narc_challenge, arc_easy\npiqa, winogrande, etc."] MMLU["MMLU (Full)\nPeriodically on 4th run"] end RunTasks --> RegularTasks RunTasks --> MMLU RegularTasks --> |"_run_lm_eval()"| LMEval["lm-eval Command"] MMLU --> |"_run_lm_eval() with few-shot"| LMEval LMEval --> Results["Results JSON"] Results --> |"_process_results()"| Metrics["Parse Metrics"] Metrics --> InfluxDB["Log to InfluxDB"]
Sources: scripts/evaluator.py:446-557
Results Processing
Section titled “Results Processing”After each benchmark run, the Evaluator processes the results by:
- Parsing the JSON output from lm-eval
- Extracting relevant metrics (accuracy scores)
- Logging both individual task results and summary metrics to InfluxDB
The system prioritizes certain metrics in a specific order (e.g., acc_norm
over acc
) based on their relevance and reliability.
Sources: scripts/evaluator.py:338-441
Benchmark Tasks
Section titled “Benchmark Tasks”The Evaluator supports multiple benchmark tasks that assess different capabilities of the language model:
Task | Description | Metric | Execution Mode |
---|---|---|---|
arc_challenge | AI2 Reasoning Challenge (hard) | acc/acc_norm | Zero-shot |
arc_easy | AI2 Reasoning Challenge (easy) | acc/acc_norm | Zero-shot |
winogrande | Winograd Schema Challenge | acc/acc_norm | Zero-shot |
piqa | Physical Interaction QA | acc/acc_norm | Zero-shot |
hellaswag | Commonsense NLI | acc/acc_norm | Zero-shot |
openbookqa | Open Book Question Answering | acc/acc_norm | Zero-shot |
mmlu | Massive Multitask Language Understanding | acc/acc_norm | Both zero-shot and 5-shot |
For MMLU specifically, the Evaluator can run it in two modes:
- Regular zero-shot evaluation with other tasks
- Periodic 5-shot evaluation (every 4th run) with different configuration
Sources: scripts/evaluator.py:89-94 , scripts/evaluator.py:493-545
Benchmark Execution Details
Section titled “Benchmark Execution Details”The Evaluator runs benchmarks by:
- Saving the model to a temporary location
- Executing the lm-eval command with appropriate parameters
- Collecting and parsing results from the output JSON
flowchart LR subgraph "Evaluator._run_lm_eval" BuildCmd["Build lm-eval Command"] --> RunCmd["Execute Command"] RunCmd --> TrackTime["Track Runtime"] TrackTime --> ReturnResults["Return Results"] end subgraph "Command Parameters" Model["--model hf"] ModelArgs["--model_args pretrained=path"] Tasks["--tasks task1,task2"] Device["--device cuda:x"] BatchSize["--batch_size 8"] Output["--output_path path"] FewShot["--num_fewshot (optional)"] Limit["--limit (optional)"] end BuildCmd --> Model BuildCmd --> ModelArgs BuildCmd --> Tasks BuildCmd --> Device BuildCmd --> BatchSize BuildCmd --> Output BuildCmd --> FewShot BuildCmd --> Limit
Sources: scripts/evaluator.py:284-336
Configuration Options
Section titled “Configuration Options”The Evaluator provides several configuration options to customize its behavior:
Parameter | Description | Default |
---|---|---|
—netuid | Bittensor network UID | 3 |
—actual_batch_size | Evaluation batch size | 8 |
—device | Device to use for evaluation | cuda:7 |
—tasks | Comma-separated list of tasks | arc_challenge,arc_easy,openbookqa,winogrande,piqa,hellaswag,mmlu |
—checkpoint_path | Path to save/load checkpoints | checkpoints/ |
—eval_interval | Seconds between evaluations | 600 (10 mins) |
—uid | Override the wallet’s UID | None |
—skip-gaps | Skip gaps in the evaluation process | False |
Sources: scripts/evaluator.py:61-122
Deployment
Section titled “Deployment”The Evaluator is designed to run as a continuous service. It can be deployed using Docker via the provided compose file.
Docker Deployment
Section titled “Docker Deployment”The repository includes a Docker Compose configuration for deploying the Evaluator:
# Key elements from the compose.yml file:# - Uses the Templar container image# - Maps Bittensor wallet directory# - Sets up required environment variables# - Configures GPU access# - Sets up logging with journald# - Includes watchtower for automatic updates
Sources: scripts/evaluator-setup/compose.yml:1-33
Environment Requirements
Section titled “Environment Requirements”The Evaluator requires several environment variables:
R2_DATASET_ACCOUNT_ID
: R2 dataset account identifierR2_DATASET_BUCKET_NAME
: R2 storage bucket nameR2_DATASET_READ_ACCESS_KEY_ID
: R2 read access keyR2_DATASET_READ_SECRET_ACCESS_KEY
: R2 secret access keyINFLUXDB_TOKEN
: InfluxDB API token (optional)
Sources: scripts/evaluator.py:15-25
Metrics Logging
Section titled “Metrics Logging”The Evaluator logs detailed metrics to InfluxDB:
- Benchmark Metrics: Runtime performance and execution details
- Task Results: Individual scores for each benchmark task
- Summary Metrics: Aggregated statistics across all tasks
flowchart TD subgraph "Metrics Collection" Benchmark["Benchmark Metrics\n- Runtime\n- Exit codes"] TaskScores["Task Scores\n- acc/acc_norm by task"] Summary["Summary Metrics\n- Number of tasks\n- Window/block info"] end subgraph "MetricsLogger" Log["log() method"] Tags["Metric Tags\n- window\n- block\n- global_step\n- task"] Fields["Metric Fields\n- scores\n- runtime\n- count"] end subgraph "Storage" InfluxDB["InfluxDB\nTime-series database"] end Benchmark --> Log TaskScores --> Log Summary --> Log Log --> Tags Log --> Fields Log --> InfluxDB
Sources: scripts/evaluator.py:368-441
Integration with Other Systems
Section titled “Integration with Other Systems”The Evaluator integrates with several other components of the Templar framework:
- Comms System: For checkpoint retrieval and blockchain interaction
- Metrics System: For logging evaluation results
- Storage System: For accessing model checkpoints
It operates independently of miners and validators but provides crucial feedback on the quality of the model being trained by the network.
Sources: scripts/evaluator.py:184-196