Skip to content

Validators

Relevant Source Files

Validators are a critical component of the Templar decentralized training framework. They are responsible for evaluating miners’ gradient contributions, assessing their quality, and setting weights on the Bittensor blockchain to determine reward distribution. This page details the architecture, functionality, and operation of validators within the Templar system.

For information about the miners that validators evaluate, see Miners. For details on how validators set weights on the blockchain, see Weight Setting.

Validators maintain model state, evaluate gradients from miners, and update weights on the Bittensor blockchain. The architectural design enables efficient evaluation of multiple miners while maintaining system integrity.

graph TD
    subgraph "Validator Core Components"
        VM["Model (LlamaForCausalLM)"]
        VO["Optimizer (SGD)"]
        VS["Scheduler"]
        VT["Momentum Tracker"]
        VR["OpenSkill Rating System"]
    end

    subgraph "Evaluation System"
        GE["Gradient Evaluation"]
        SS["Sync Score Calculation"]
        BS["Binary Score Tracking"]
        FS["Final Score Calculation"]
        WN["Weight Normalization"]
    end

    subgraph "Communication"
        CM["Comms Module"]
        CG["Gradient Gathering"]
        CB["Blockchain Integration"]
        CP["Checkpoint Management"]
    end

    VM --> GE
    VO --> GE
    VS --> GE
    VT --> GE
    GE --> VR
    
    CG --> GE
    SS --> FS
    BS --> FS
    VR --> FS
    FS --> WN
    WN --> CB

    CM <--> CG
    CM <--> CB
    CM <--> CP

Sources: neurons/validator.py:85-267 , neurons/validator.py:355-480

Validators initialize with a model identical to miners, along with compression systems for efficient gradient processing, and integration with the Bittensor network.

sequenceDiagram
    participant V as "Validator"
    participant BT as "Bittensor Network"
    participant R2 as "R2 Storage"
    participant M as "Model & Components"
    
    V->>BT: Initialize wallet and subtensor
    V->>BT: Get metagraph and UID
    V->>M: Initialize LLaMA model
    V->>M: Setup DCT compression
    V->>M: Initialize optimizer & scheduler
    V->>M: Setup momentum tracking
    V->>V: Initialize OpenSkill ratings
    V->>R2: Setup communication module
    V->>R2: Create & commit to buckets
    V->>V: Initialize scoring system
    V->>BT: Verify registration

The Validator class is initialized with several key components:

  1. Bittensor Network Integration: Wallet, subtensor, and metagraph objects for blockchain interaction
  2. Model: LlamaForCausalLM instance identical to what miners use
  3. Compression: DCT transformation and compression for efficient gradient handling
  4. Optimizer and Momentum: SGD optimizer and momentum tracking for gradient evaluation
  5. Rating System: OpenSkill-based rating system for evaluating miner contributions
  6. Storage Integration: R2 bucket communication for gradient exchange

Sources: neurons/validator.py:126-267

The validator’s main operation revolves around evaluating miner gradients by checking how they affect model performance.

flowchart TD
    subgraph "Validator Operation Cycle"
        direction TB
        A["Sync Window Start"] --> B["Update Peers"]
        B --> C["Gather Gradients"]
        C --> D["Evaluate Miner Sync"]
        D --> E["Evaluate Gradients"]
        E --> F["Update Ratings"]
        F --> G["Calculate Scores"]
        G --> H["Set Weights on Chain"]
        H --> I["Next Window"]
        I --> A
    end

    subgraph "Gradient Evaluation"
        direction TB
        E1["Select Miner"] --> E2["Get Dataset Pages"]
        E2 --> E3["Measure Loss Before"]
        E3 --> E4["Apply Miner Gradient"]
        E4 --> E5["Measure Loss After"]
        E5 --> E6["Calculate Improvement"]
        E6 --> E7["Store Evaluation Result"]
    end

    subgraph "Weight Setting"
        direction TB
        G1["Binary Score Update"] --> G2["Sync Score Update"]
        G2 --> G3["Apply OpenSkill Ratings"]
        G3 --> G4["Apply Power Normalization"]
        G4 --> G5["Finalize Weights"]
    end

    E --> E1
    G --> G1

Sources: neurons/validator.py:516-635 . neurons/validator.py:695-787 , neurons/validator.py:374-445

Validators assess miners by measuring the improvement in model performance after applying their gradients:

  1. Gather Miner Gradients: Validators collect compressed gradients from miners for the current window
  2. Decompress Gradients: Transform the compressed gradients back to usable form
  3. Evaluate Improvement: Apply the gradients to the model and measure improvement in loss
  4. Calculate Scores: Determine quality scores based on the measured improvement
  5. Update Ratings: Update miner ratings using the OpenSkill system
graph TD
    subgraph "Gradient Evaluation Flow"
        Begin["Start Evaluation"] --> GatherGrad["Gather Miner Gradients"]
        GatherGrad --> Decompress["Decompress Gradients"]
        Decompress --> EvalBeforeLoss["Evaluate Loss Before"]
        EvalBeforeLoss --> ApplyGrad["Apply Gradients to Model"]
        ApplyGrad --> EvalAfterLoss["Evaluate Loss After"]
        EvalAfterLoss --> CalcImprovement["Calculate Improvement\nLossBefore - LossAfter"]
        CalcImprovement --> UpdateScore["Update Miner Score"]
        UpdateScore --> OpenSkill["Update OpenSkill Rating"]
        OpenSkill --> SyncEval["Evaluate Model Sync"]
        SyncEval --> FinalScore["Calculate Final Score"]
    end

Key components of the evaluation process:

  • Loss Calculation: evaluate_model_on_batches() calculates loss on the same dataset used by the miner
  • Improvement Metric: Improvement is measured as the difference between loss before and after applying gradients
  • Batch Sampling: Validators sample a subset of batches to efficiently evaluate performance
  • OpenSkill Rating: The PlackettLuce model updates ratings based on relative performance

Sources: neurons/validator.py:489-514 , neurons/validator.py:374-445

Validators use multiple scoring components to evaluate miners:

The OpenSkill rating system provides a probabilistic skill rating that accounts for uncertainty and relative performance between peers:

# Each miner has an OpenSkill rating maintained by validators
openskill_mu = float(self.openskill_ratings[uid].mu) # Mean skill
openskill_sigma = float(self.openskill_ratings[uid].sigma) # Uncertainty
openskill_ordinal = float(self.openskill_ratings[uid].ordinal()) # Combined score

Validators update these ratings based on gradient evaluation results, using the PlackettLuce model where higher gradient scores indicate better performance.

Multiple scoring components are combined for the final weight calculation:

  1. Gradient Scores: Direct measurement of loss improvement
  2. Binary Indicator Scores: Tracks whether contributions are consistently positive
  3. Sync Scores: Measures how well miners stay synchronized with the global model
  4. Final Scores: Combination of all metrics that determines weights
graph TD
    subgraph "Score Calculation System"
        GS["Gradient Score\nLoss Improvement Measurement"] --> BMA["Binary Moving Average\nPositive Contribution Tracking"]
        SYS["Sync Score\nModel Synchronization Quality"] --> FS["Final Score Calculation"]
        BMA --> FS
        OSR["OpenSkill Rating\nRelative Performance"] --> FS
        FS --> WN["Weight Normalization\nPower Normalization"]
        WN --> SW["Set Weights on Blockchain"]
    end

The final score calculation combines:

# Final score formula
self.final_scores[uid] = (
openskill_ordinal *
max(0, self.binary_moving_averages[uid].item()) *
sync_score
)

Sources: neurons/validator.py:374-445 , neurons/validator.py:356-380

Validators manage peer inactivity through a sophisticated penalty system:

flowchart TD
    subgraph "Inactivity Management"
        CheckActivity["Check Miner Activity"] --> IsActive{"Is Miner Active?"}
        IsActive -->|Yes| RemoveFromInactive["Remove from Inactive List"]
        IsActive -->|No| TrackInactive["Track Inactivity Period"]
        
        TrackInactive --> LongInactive{"Inactive > Reset\nThreshold?"}
        LongInactive -->|Yes| ResetPeer["Reset Peer\nZero All Scores"]
        LongInactive -->|No| ApplyPenalty["Apply Inactivity Penalty\n25% Score Reduction"]
        
        subgraph "Additional Penalties"
            MG["Missing Gradient\n75% Score Reduction"]
            SS["Sync Score Violation\n75% Score Reduction"]
        end
    end

Key inactivity handling mechanisms:

  1. Tracking System: Validators track when miners become inactive
  2. Graduated Penalties: Scores are reduced by 25% per window of inactivity
  3. Complete Reset: After extended inactivity (25 windows), scores are completely reset
  4. Additional Penalties:
    • Missing gradients during gather: 75% score reduction
    • Poor model synchronization: 75% score reduction

Sources: neurons/validator.py:302-315 , neurons/validator.py:706-733

Validators are responsible for managing checkpoints that maintain the global model state:

graph TD
    subgraph "Checkpoint Management"
        SL["Save Logic"] --> FC["Frequency Check\nEvery checkpoint_frequency Windows"]
        FC -->|"Yes"| SC["Save Checkpoint"]
        FC -->|"No"| Skip["Skip Saving"]
        
        LL["Load Logic"] --> FE["Find Existing Checkpoint"]
        FE --> LoadCP["Load Checkpoint"]
        LoadCP --> SyncCheck["Check If Behind Current Window"]
        SyncCheck -->|"Yes"| Catchup["Catch Up with Aggregation Server"]
        SyncCheck -->|"No"| Continue["Continue Normal Operation"]
    end

The checkpoint system ensures:

  1. State Persistence: Model parameters, optimizer state, and momentum are preserved
  2. Consistent Startup: Validators can recover from the last saved state
  3. Synchronization: Validators that fall behind can catch up to the current window
  4. Global Consistency: All validators operate on a consistent model state

Sources: neurons/validator.py:576-623

Validators strategically manage which miners to evaluate and interact with:

flowchart TD
    subgraph "Peer Management System"
        IP["Initial Peer Selection"] --> RPS["Regular Peer Selection"]
        RPS --> PP["Post Peer List"]
        PP --> EP["Evaluate Peers"]
        
        subgraph "Selection Strategy"
            TS["Topk Selection\nHighest-weighted peers"]
            RS["Random Selection\nExploration"]
            PS["Prioritized Sampling\nFair evaluation"]
        end
    end

Validators employ strategies to:

  1. Balance Exploration and Exploitation: Sample both high-performing and untested miners
  2. Ensure Fair Evaluation: Distribute evaluation opportunities evenly
  3. Maintain Network Health: Regularly replace peers to prevent network stagnation
  4. Post Peer Lists: Share selected peers with the network via R2 storage

Sources: neurons/validator.py:642-704

The validator’s communication system handles interaction with the blockchain, storage systems, and other network components:

flowchart TD
    subgraph "Validator Communication"
        CM["Comms Module"] --> R2["R2 Storage Integration"]
        CM --> BT["Bittensor Blockchain"]
        CM --> AS["Aggregation Server"]
        
        R2 --> GradientsB["Gradients Bucket"]
        R2 --> CheckpointsB["Checkpoints Bucket"]
        R2 --> PeersB["Peer Lists Storage"]
        
        GradientsB --> GG["Gradient Gathering"]
        CheckpointsB --> CPM["Checkpoint Management"]
        PeersB --> PLM["Peer List Management"]
        
        BT --> WS["Weight Setting"]
        BT --> MM["Metagraph Monitoring"]
    end

Key communication functions include:

  1. Gradient Exchange: Gathering miner gradients from R2 storage
  2. Checkpoint Management: Loading and saving model checkpoints
  3. Peer List Posting: Sharing selected peers for evaluation
  4. Blockchain Integration: Setting weights and monitoring network state
  5. Aggregation Server Integration: Synchronizing with global model state

Sources: neurons/validator.py:831-860 , src/tplr/comms.py:64-682

Validators have specific hardware and software requirements:

ComponentRequirementNotes
GPUNVIDIA H100 (recommended)Minimum 80GB VRAM
Storage200GB+ SSDFor model and evaluation data
RAM32GB+For efficient processing
NetworkHigh bandwidthFor state synchronization
SoftwarePyTorch, BittensorWith CUDA support

Additionally, validators require Cloudflare R2 bucket configuration for gradient exchange and checkpoint management.

Sources: docs/validator.md:306-313

Validators are configured via command-line arguments and environment variables:

# Key configuration options
parser.add_argument("--netuid", type=int, default=268, help="Bittensor network UID.")
parser.add_argument("--device", type=str, default="cuda", help="Device for training")
parser.add_argument("--store-gathers", action="store_true", help="Store gathered gradients")
parser.add_argument("--test", action="store_true", help="Test mode - use all peers")
parser.add_argument("--local", action="store_true", help="Local run with toy model")

Environment variables control R2 storage credentials, network configuration, and monitoring settings.

The Validator can also be deployed using Docker Compose for easier management.

Sources: neurons/validator.py:86-124 , docs/validator.md:116-150

The validator integrates with several other Templar systems:

  1. Miners: The nodes that validators evaluate (Miners)
  2. Aggregation Server: Provides synchronized model state (Aggregation Server)
  3. Bittensor Network: Blockchain for weight setting (Chain Integration)
  4. R2 Storage: Communication medium for gradient exchange (R2 Storage)
  5. Monitoring: Performance tracking via WandB and InfluxDB (Monitoring and Telemetry)

Validators are a cornerstone of the Templar framework, providing the critical evaluation mechanism that drives the incentive system. By accurately assessing miner contributions, validators ensure that high-quality gradients are rewarded, maintaining the integrity and performance of the collectively trained model.

The validator’s sophisticated scoring and rating systems, combined with efficient communication and checkpoint management, create a robust framework for decentralized model training that aligns individual incentives with collective performance goals.