Validators

Relevant Source Files

README.md docs/miner.md docs/validator.md ecosystem.config.js hparams.json neurons/miner.py neurons/validator.py src/tplr/__init__.py src/tplr/comms.py tests/test_checkpoints.py tests/test_evaluator.py tests/test_prepare_gradient_dict.py

Validators are a critical component of the Templar decentralized training framework. They are responsible for evaluating miners’ gradient contributions, assessing their quality, and setting weights on the Bittensor blockchain to determine reward distribution. This page details the architecture, functionality, and operation of validators within the Templar system.

For information about the miners that validators evaluate, see Miners. For details on how validators set weights on the blockchain, see Weight Setting.

Validator Architecture

Validators maintain model state, evaluate gradients from miners, and update weights on the Bittensor blockchain. The architectural design enables efficient evaluation of multiple miners while maintaining system integrity.

graph TD
    subgraph "Validator Core Components"
        VM["Model (LlamaForCausalLM)"]
        VO["Optimizer (SGD)"]
        VS["Scheduler"]
        VT["Momentum Tracker"]
        VR["OpenSkill Rating System"]
    end

    subgraph "Evaluation System"
        GE["Gradient Evaluation"]
        SS["Sync Score Calculation"]
        BS["Binary Score Tracking"]
        FS["Final Score Calculation"]
        WN["Weight Normalization"]
    end

    subgraph "Communication"
        CM["Comms Module"]
        CG["Gradient Gathering"]
        CB["Blockchain Integration"]
        CP["Checkpoint Management"]
    end

    VM --> GE
    VO --> GE
    VS --> GE
    VT --> GE
    GE --> VR
    
    CG --> GE
    SS --> FS
    BS --> FS
    VR --> FS
    FS --> WN
    WN --> CB

    CM <--> CG
    CM <--> CB
    CM <--> CP

Sources: neurons/validator.py:85-267 , neurons/validator.py:355-480

Validator Initialization

Validators initialize with a model identical to miners, along with compression systems for efficient gradient processing, and integration with the Bittensor network.

sequenceDiagram
    participant V as "Validator"
    participant BT as "Bittensor Network"
    participant R2 as "R2 Storage"
    participant M as "Model & Components"
    
    V->>BT: Initialize wallet and subtensor
    V->>BT: Get metagraph and UID
    V->>M: Initialize LLaMA model
    V->>M: Setup DCT compression
    V->>M: Initialize optimizer & scheduler
    V->>M: Setup momentum tracking
    V->>V: Initialize OpenSkill ratings
    V->>R2: Setup communication module
    V->>R2: Create & commit to buckets
    V->>V: Initialize scoring system
    V->>BT: Verify registration

The Validator class is initialized with several key components:

Bittensor Network Integration: Wallet, subtensor, and metagraph objects for blockchain interaction
Model: LlamaForCausalLM instance identical to what miners use
Compression: DCT transformation and compression for efficient gradient handling
Optimizer and Momentum: SGD optimizer and momentum tracking for gradient evaluation
Rating System: OpenSkill-based rating system for evaluating miner contributions
Storage Integration: R2 bucket communication for gradient exchange

Sources: neurons/validator.py:126-267

Validation Workflow

The validator’s main operation revolves around evaluating miner gradients by checking how they affect model performance.

flowchart TD
    subgraph "Validator Operation Cycle"
        direction TB
        A["Sync Window Start"] --> B["Update Peers"]
        B --> C["Gather Gradients"]
        C --> D["Evaluate Miner Sync"]
        D --> E["Evaluate Gradients"]
        E --> F["Update Ratings"]
        F --> G["Calculate Scores"]
        G --> H["Set Weights on Chain"]
        H --> I["Next Window"]
        I --> A
    end

    subgraph "Gradient Evaluation"
        direction TB
        E1["Select Miner"] --> E2["Get Dataset Pages"]
        E2 --> E3["Measure Loss Before"]
        E3 --> E4["Apply Miner Gradient"]
        E4 --> E5["Measure Loss After"]
        E5 --> E6["Calculate Improvement"]
        E6 --> E7["Store Evaluation Result"]
    end

    subgraph "Weight Setting"
        direction TB
        G1["Binary Score Update"] --> G2["Sync Score Update"]
        G2 --> G3["Apply OpenSkill Ratings"]
        G3 --> G4["Apply Power Normalization"]
        G4 --> G5["Finalize Weights"]
    end

    E --> E1
    G --> G1

Sources: neurons/validator.py:516-635 . neurons/validator.py:695-787 , neurons/validator.py:374-445

Gradient Evaluation Process

Validators assess miners by measuring the improvement in model performance after applying their gradients:

Gather Miner Gradients: Validators collect compressed gradients from miners for the current window
Decompress Gradients: Transform the compressed gradients back to usable form
Evaluate Improvement: Apply the gradients to the model and measure improvement in loss
Calculate Scores: Determine quality scores based on the measured improvement
Update Ratings: Update miner ratings using the OpenSkill system

graph TD
    subgraph "Gradient Evaluation Flow"
        Begin["Start Evaluation"] --> GatherGrad["Gather Miner Gradients"]
        GatherGrad --> Decompress["Decompress Gradients"]
        Decompress --> EvalBeforeLoss["Evaluate Loss Before"]
        EvalBeforeLoss --> ApplyGrad["Apply Gradients to Model"]
        ApplyGrad --> EvalAfterLoss["Evaluate Loss After"]
        EvalAfterLoss --> CalcImprovement["Calculate Improvement\nLossBefore - LossAfter"]
        CalcImprovement --> UpdateScore["Update Miner Score"]
        UpdateScore --> OpenSkill["Update OpenSkill Rating"]
        OpenSkill --> SyncEval["Evaluate Model Sync"]
        SyncEval --> FinalScore["Calculate Final Score"]
    end

Key components of the evaluation process:

Loss Calculation: evaluate_model_on_batches() calculates loss on the same dataset used by the miner
Improvement Metric: Improvement is measured as the difference between loss before and after applying gradients
Batch Sampling: Validators sample a subset of batches to efficiently evaluate performance
OpenSkill Rating: The PlackettLuce model updates ratings based on relative performance

Sources: neurons/validator.py:489-514 , neurons/validator.py:374-445

Scoring Mechanisms

Validators use multiple scoring components to evaluate miners:

OpenSkill Rating System

The OpenSkill rating system provides a probabilistic skill rating that accounts for uncertainty and relative performance between peers:

# Each miner has an OpenSkill rating maintained by validators
openskill_mu = float(self.openskill_ratings[uid].mu)         # Mean skill
openskill_sigma = float(self.openskill_ratings[uid].sigma)   # Uncertainty
openskill_ordinal = float(self.openskill_ratings[uid].ordinal()) # Combined score

Validators update these ratings based on gradient evaluation results, using the PlackettLuce model where higher gradient scores indicate better performance.

Score Components

Multiple scoring components are combined for the final weight calculation:

Gradient Scores: Direct measurement of loss improvement
Binary Indicator Scores: Tracks whether contributions are consistently positive
Sync Scores: Measures how well miners stay synchronized with the global model
Final Scores: Combination of all metrics that determines weights

graph TD
    subgraph "Score Calculation System"
        GS["Gradient Score\nLoss Improvement Measurement"] --> BMA["Binary Moving Average\nPositive Contribution Tracking"]
        SYS["Sync Score\nModel Synchronization Quality"] --> FS["Final Score Calculation"]
        BMA --> FS
        OSR["OpenSkill Rating\nRelative Performance"] --> FS
        FS --> WN["Weight Normalization\nPower Normalization"]
        WN --> SW["Set Weights on Blockchain"]
    end

The final score calculation combines:

# Final score formula
self.final_scores[uid] = (
    openskill_ordinal *
    max(0, self.binary_moving_averages[uid].item()) *
    sync_score
)

Sources: neurons/validator.py:374-445 , neurons/validator.py:356-380

Handling Inactivity and Penalties

Validators manage peer inactivity through a sophisticated penalty system:

flowchart TD
    subgraph "Inactivity Management"
        CheckActivity["Check Miner Activity"] --> IsActive{"Is Miner Active?"}
        IsActive -->|Yes| RemoveFromInactive["Remove from Inactive List"]
        IsActive -->|No| TrackInactive["Track Inactivity Period"]
        
        TrackInactive --> LongInactive{"Inactive > Reset\nThreshold?"}
        LongInactive -->|Yes| ResetPeer["Reset Peer\nZero All Scores"]
        LongInactive -->|No| ApplyPenalty["Apply Inactivity Penalty\n25% Score Reduction"]
        
        subgraph "Additional Penalties"
            MG["Missing Gradient\n75% Score Reduction"]
            SS["Sync Score Violation\n75% Score Reduction"]
        end
    end

Key inactivity handling mechanisms:

Tracking System: Validators track when miners become inactive
Graduated Penalties: Scores are reduced by 25% per window of inactivity
Complete Reset: After extended inactivity (25 windows), scores are completely reset
Additional Penalties:
- Missing gradients during gather: 75% score reduction
- Poor model synchronization: 75% score reduction

Sources: neurons/validator.py:302-315 , neurons/validator.py:706-733

Checkpoint Management

Validators are responsible for managing checkpoints that maintain the global model state:

graph TD
    subgraph "Checkpoint Management"
        SL["Save Logic"] --> FC["Frequency Check\nEvery checkpoint_frequency Windows"]
        FC -->|"Yes"| SC["Save Checkpoint"]
        FC -->|"No"| Skip["Skip Saving"]
        
        LL["Load Logic"] --> FE["Find Existing Checkpoint"]
        FE --> LoadCP["Load Checkpoint"]
        LoadCP --> SyncCheck["Check If Behind Current Window"]
        SyncCheck -->|"Yes"| Catchup["Catch Up with Aggregation Server"]
        SyncCheck -->|"No"| Continue["Continue Normal Operation"]
    end

The checkpoint system ensures:

State Persistence: Model parameters, optimizer state, and momentum are preserved
Consistent Startup: Validators can recover from the last saved state
Synchronization: Validators that fall behind can catch up to the current window
Global Consistency: All validators operate on a consistent model state

Sources: neurons/validator.py:576-623

Peer Management and Evaluation

Validators strategically manage which miners to evaluate and interact with:

flowchart TD
    subgraph "Peer Management System"
        IP["Initial Peer Selection"] --> RPS["Regular Peer Selection"]
        RPS --> PP["Post Peer List"]
        PP --> EP["Evaluate Peers"]
        
        subgraph "Selection Strategy"
            TS["Topk Selection\nHighest-weighted peers"]
            RS["Random Selection\nExploration"]
            PS["Prioritized Sampling\nFair evaluation"]
        end
    end

Validators employ strategies to:

Balance Exploration and Exploitation: Sample both high-performing and untested miners
Ensure Fair Evaluation: Distribute evaluation opportunities evenly
Maintain Network Health: Regularly replace peers to prevent network stagnation
Post Peer Lists: Share selected peers with the network via R2 storage

Sources: neurons/validator.py:642-704

Communication System

The validator’s communication system handles interaction with the blockchain, storage systems, and other network components:

flowchart TD
    subgraph "Validator Communication"
        CM["Comms Module"] --> R2["R2 Storage Integration"]
        CM --> BT["Bittensor Blockchain"]
        CM --> AS["Aggregation Server"]
        
        R2 --> GradientsB["Gradients Bucket"]
        R2 --> CheckpointsB["Checkpoints Bucket"]
        R2 --> PeersB["Peer Lists Storage"]
        
        GradientsB --> GG["Gradient Gathering"]
        CheckpointsB --> CPM["Checkpoint Management"]
        PeersB --> PLM["Peer List Management"]
        
        BT --> WS["Weight Setting"]
        BT --> MM["Metagraph Monitoring"]
    end

Key communication functions include:

Gradient Exchange: Gathering miner gradients from R2 storage
Checkpoint Management: Loading and saving model checkpoints
Peer List Posting: Sharing selected peers for evaluation
Blockchain Integration: Setting weights and monitoring network state
Aggregation Server Integration: Synchronizing with global model state

Sources: neurons/validator.py:831-860 , src/tplr/comms.py:64-682

Environment Requirements

Validators have specific hardware and software requirements:

Component	Requirement	Notes
GPU	NVIDIA H100 (recommended)	Minimum 80GB VRAM
Storage	200GB+ SSD	For model and evaluation data
RAM	32GB+	For efficient processing
Network	High bandwidth	For state synchronization
Software	PyTorch, Bittensor	With CUDA support

Additionally, validators require Cloudflare R2 bucket configuration for gradient exchange and checkpoint management.

Sources: docs/validator.md:306-313

Configuration and Setup

Validators are configured via command-line arguments and environment variables:

# Key configuration options
parser.add_argument("--netuid", type=int, default=268, help="Bittensor network UID.")
parser.add_argument("--device", type=str, default="cuda", help="Device for training")
parser.add_argument("--store-gathers", action="store_true", help="Store gathered gradients")
parser.add_argument("--test", action="store_true", help="Test mode - use all peers")
parser.add_argument("--local", action="store_true", help="Local run with toy model")

Environment variables control R2 storage credentials, network configuration, and monitoring settings.

The Validator can also be deployed using Docker Compose for easier management.

Sources: neurons/validator.py:86-124 , docs/validator.md:116-150

The validator integrates with several other Templar systems:

Miners: The nodes that validators evaluate (Miners)
Aggregation Server: Provides synchronized model state (Aggregation Server)
Bittensor Network: Blockchain for weight setting (Chain Integration)
R2 Storage: Communication medium for gradient exchange (R2 Storage)
Monitoring: Performance tracking via WandB and InfluxDB (Monitoring and Telemetry)

Conclusion

Validators are a cornerstone of the Templar framework, providing the critical evaluation mechanism that drives the incentive system. By accurately assessing miner contributions, validators ensure that high-quality gradients are rewarded, maintaining the integrity and performance of the collectively trained model.

The validator’s sophisticated scoring and rating systems, combined with efficient communication and checkpoint management, create a robust framework for decentralized model training that aligns individual incentives with collective performance goals.