Evaluator
The Evaluator is an autonomous service in the Templar framework that continuously assesses model performance by running standardized benchmark tasks on the latest checkpoints. It operates independently from miners and validators, providing objective measurements of model capabilities throughout the training process.
Sources: scripts/evaluator.py:1-40
Overview
Section titled “Overview”The Evaluator performs the following key functions:
- Monitors for new model checkpoints by window number
- Downloads and loads the latest checkpoint
- Executes standardized benchmark tasks (e.g., ARC, MMLU, Winogrande)
- Logs evaluation results to InfluxDB for monitoring and analysis
- Manages resources efficiently for continuous operation
Unlike miners that generate gradients or validators that assess gradient quality, the Evaluator focuses exclusively on measuring the end-to-end capabilities of the current model against established benchmarks.
flowchart TD
subgraph "Evaluator Service"
Monitor["Monitor Checkpoints"] --> Detect["Detect New Checkpoint"]
Detect --> Load["Load Latest Model"]
Load --> Evaluate["Run Benchmarks"]
Evaluate --> Log["Log Results"]
Log --> Cleanup["Clean Resources"]
Cleanup --> Monitor
end
subgraph "External Components"
R2["Checkpoint Storage"]
InfluxDB["InfluxDB Metrics"]
LMEval["LM Evaluation Harness"]
end
Load <--> R2
Evaluate <--> LMEval
Log --> InfluxDB
Sources: scripts/evaluator.py:125-147 , scripts/evaluator.py:559-568
Architecture
Section titled “Architecture”The Evaluator operates as a standalone service that interacts with the Templar checkpoint storage system and metrics infrastructure. It leverages the Language Model Evaluation Harness (lm-eval) to execute standardized benchmarks.
classDiagram
class Evaluator {
-config: bt.Config
-netuid: int
-model: LlamaForCausalLM
-metrics_logger: MetricsLogger
-last_eval_window: int
-comms: tplr.comms.Comms
+__init__()
+update_state()
+load_latest_model()
+_run_lm_eval()
+_process_results()
+_evaluate()
+run()
+cleanup()
}
class Comms {
+get_latest_checkpoint()
+get_commitments()
+update_peers_with_buckets()
+get_start_window()
}
class MetricsLogger {
+log()
}
class LlamaForCausalLM {
+save_pretrained()
+load_state_dict()
}
Evaluator --> Comms : uses
Evaluator --> MetricsLogger : logs metrics via
Evaluator --> LlamaForCausalLM : evaluates
Sources: scripts/evaluator.py:125-198
Evaluation Workflow
Section titled “Evaluation Workflow”Checkpoint Detection and Loading
Section titled “Checkpoint Detection and Loading”The Evaluator continuously monitors the blockchain for the latest checkpoint by window number. When a new checkpoint is detected, it downloads and loads the model weights.
sequenceDiagram
participant Evaluator
participant Comms as "Comms Module"
participant R2 as "R2 Storage"
participant Model as "LLaMA Model"
Evaluator->>Comms: update_state()
Evaluator->>Comms: get_start_window()
Comms-->>Evaluator: current_window
alt new checkpoint available
Evaluator->>Comms: get_latest_checkpoint(version)
Comms->>R2: fetch latest checkpoint
R2-->>Comms: checkpoint data
Comms-->>Evaluator: checkpoint_data, metadata
Evaluator->>Model: load_state_dict(checkpoint_data["model_state_dict"])
Evaluator->>Evaluator: Store momentum data
else no new checkpoint
Evaluator->>Evaluator: Wait for next interval
end
Sources: scripts/evaluator.py:211-282 , tests/test_evaluator.py:59-83
Benchmark Execution
Section titled “Benchmark Execution”When a new checkpoint is loaded, the Evaluator executes a series of benchmark tasks using the LM Evaluation Harness (lm-eval). Different tasks can be configured and executed on a schedule.
flowchart TD
subgraph "Evaluator._evaluate()"
LoadModel["Load Latest Model"] --> SaveTemp["Save Model to Temp Location"]
SaveTemp --> RunTasks["Run Benchmark Tasks"]
RunTasks --> ProcessResults["Process Results"]
ProcessResults --> CleanupTemp["Clean Up Temp Files"]
end
subgraph "Benchmark Tasks"
RegularTasks["Regular Tasks\narc_challenge, arc_easy\npiqa, winogrande, etc."]
MMLU["MMLU (Full)\nPeriodically on 4th run"]
end
RunTasks --> RegularTasks
RunTasks --> MMLU
RegularTasks --> |"_run_lm_eval()"| LMEval["lm-eval Command"]
MMLU --> |"_run_lm_eval() with few-shot"| LMEval
LMEval --> Results["Results JSON"]
Results --> |"_process_results()"| Metrics["Parse Metrics"]
Metrics --> InfluxDB["Log to InfluxDB"]
Sources: scripts/evaluator.py:446-557
Results Processing
Section titled “Results Processing”After each benchmark run, the Evaluator processes the results by:
- Parsing the JSON output from lm-eval
- Extracting relevant metrics (accuracy scores)
- Logging both individual task results and summary metrics to InfluxDB
The system prioritizes certain metrics in a specific order (e.g., acc_norm over acc) based on their relevance and reliability.
Sources: scripts/evaluator.py:338-441
Benchmark Tasks
Section titled “Benchmark Tasks”The Evaluator supports multiple benchmark tasks that assess different capabilities of the language model:
| Task | Description | Metric | Execution Mode |
|---|---|---|---|
| arc_challenge | AI2 Reasoning Challenge (hard) | acc/acc_norm | Zero-shot |
| arc_easy | AI2 Reasoning Challenge (easy) | acc/acc_norm | Zero-shot |
| winogrande | Winograd Schema Challenge | acc/acc_norm | Zero-shot |
| piqa | Physical Interaction QA | acc/acc_norm | Zero-shot |
| hellaswag | Commonsense NLI | acc/acc_norm | Zero-shot |
| openbookqa | Open Book Question Answering | acc/acc_norm | Zero-shot |
| mmlu | Massive Multitask Language Understanding | acc/acc_norm | Both zero-shot and 5-shot |
For MMLU specifically, the Evaluator can run it in two modes:
- Regular zero-shot evaluation with other tasks
- Periodic 5-shot evaluation (every 4th run) with different configuration
Sources: scripts/evaluator.py:89-94 , scripts/evaluator.py:493-545
Benchmark Execution Details
Section titled “Benchmark Execution Details”The Evaluator runs benchmarks by:
- Saving the model to a temporary location
- Executing the lm-eval command with appropriate parameters
- Collecting and parsing results from the output JSON
flowchart LR
subgraph "Evaluator._run_lm_eval"
BuildCmd["Build lm-eval Command"] --> RunCmd["Execute Command"]
RunCmd --> TrackTime["Track Runtime"]
TrackTime --> ReturnResults["Return Results"]
end
subgraph "Command Parameters"
Model["--model hf"]
ModelArgs["--model_args pretrained=path"]
Tasks["--tasks task1,task2"]
Device["--device cuda:x"]
BatchSize["--batch_size 8"]
Output["--output_path path"]
FewShot["--num_fewshot (optional)"]
Limit["--limit (optional)"]
end
BuildCmd --> Model
BuildCmd --> ModelArgs
BuildCmd --> Tasks
BuildCmd --> Device
BuildCmd --> BatchSize
BuildCmd --> Output
BuildCmd --> FewShot
BuildCmd --> Limit
Sources: scripts/evaluator.py:284-336
Configuration Options
Section titled “Configuration Options”The Evaluator provides several configuration options to customize its behavior:
| Parameter | Description | Default |
|---|---|---|
| —netuid | Bittensor network UID | 3 |
| —actual_batch_size | Evaluation batch size | 8 |
| —device | Device to use for evaluation | cuda:7 |
| —tasks | Comma-separated list of tasks | arc_challenge,arc_easy,openbookqa,winogrande,piqa,hellaswag,mmlu |
| —checkpoint_path | Path to save/load checkpoints | checkpoints/ |
| —eval_interval | Seconds between evaluations | 600 (10 mins) |
| —uid | Override the wallet’s UID | None |
| —skip-gaps | Skip gaps in the evaluation process | False |
Sources: scripts/evaluator.py:61-122
Deployment
Section titled “Deployment”The Evaluator is designed to run as a continuous service. It can be deployed using Docker via the provided compose file.
Docker Deployment
Section titled “Docker Deployment”The repository includes a Docker Compose configuration for deploying the Evaluator:
# Key elements from the compose.yml file:# - Uses the Templar container image# - Maps Bittensor wallet directory# - Sets up required environment variables# - Configures GPU access# - Sets up logging with journald# - Includes watchtower for automatic updatesSources: scripts/evaluator-setup/compose.yml:1-33
Environment Requirements
Section titled “Environment Requirements”The Evaluator requires several environment variables:
R2_DATASET_ACCOUNT_ID: R2 dataset account identifierR2_DATASET_BUCKET_NAME: R2 storage bucket nameR2_DATASET_READ_ACCESS_KEY_ID: R2 read access keyR2_DATASET_READ_SECRET_ACCESS_KEY: R2 secret access keyINFLUXDB_TOKEN: InfluxDB API token (optional)
Sources: scripts/evaluator.py:15-25
Metrics Logging
Section titled “Metrics Logging”The Evaluator logs detailed metrics to InfluxDB:
- Benchmark Metrics: Runtime performance and execution details
- Task Results: Individual scores for each benchmark task
- Summary Metrics: Aggregated statistics across all tasks
flowchart TD
subgraph "Metrics Collection"
Benchmark["Benchmark Metrics\n- Runtime\n- Exit codes"]
TaskScores["Task Scores\n- acc/acc_norm by task"]
Summary["Summary Metrics\n- Number of tasks\n- Window/block info"]
end
subgraph "MetricsLogger"
Log["log() method"]
Tags["Metric Tags\n- window\n- block\n- global_step\n- task"]
Fields["Metric Fields\n- scores\n- runtime\n- count"]
end
subgraph "Storage"
InfluxDB["InfluxDB\nTime-series database"]
end
Benchmark --> Log
TaskScores --> Log
Summary --> Log
Log --> Tags
Log --> Fields
Log --> InfluxDB
Sources: scripts/evaluator.py:368-441
Integration with Other Systems
Section titled “Integration with Other Systems”The Evaluator integrates with several other components of the Templar framework:
- Comms System: For checkpoint retrieval and blockchain interaction
- Metrics System: For logging evaluation results
- Storage System: For accessing model checkpoints
It operates independently of miners and validators but provides crucial feedback on the quality of the model being trained by the network.
Sources: scripts/evaluator.py:184-196