System Architecture

Relevant Source Files

ecosystem.config.js hparams.json neurons/aggregator.py neurons/miner.py neurons/validator.py pyproject.toml src/tplr/__init__.py src/tplr/chain.py src/tplr/comms.py src/tplr/compress.py src/tplr/neurons.py telemetry/simulator/loki-test.py tests/test_comms.py tests/test_model_comparison.py uv.lock

This document provides a comprehensive overview of the Templar framework’s architecture, outlining its core components and their interactions. Templar is a decentralized training framework built on the Bittensor network that enables collaborative training of large language models across distributed compute resources.

For information about the incentive mechanisms that drive the distributed training process, see Incentive Design.

Core Components

graph TD
    subgraph "Network Layer"
        BN["Bittensor Network"]
        MG["Metagraph"]
    end

    subgraph "Node Layer"
        M["Miners"]
        V["Validators"]
        A["AggregationServer"]
    end

    subgraph "Storage Layer"
        R2["Cloudflare R2 Storage"]
        subgraph "Buckets"
            GB["Gradients Bucket"]
            DB["Dataset Bucket"]
            AB["Aggregator Bucket"]
            CP["Checkpoint Storage"]
        end
    end

    subgraph "Communication Layer"
        CM["Comms Module"]
        CH["Chain Manager"]
    end

    M <--> BN
    V <--> BN
    BN <--> MG

    M <--> CM
    V <--> CM
    A <--> CM
    
    CM <--> CH
    CH <--> BN

    GB <--> CM
    DB <--> CM
    AB <--> CM
    CP <--> CM

    style Network fill:#f9f9f9
    style Node fill:#f9f9f9
    style Storage fill:#f9f9f9
    style Communication fill:#f9f9f9

Sources:

The Templar system consists of four main architectural layers:

Network Layer: Provides the decentralized infrastructure via Bittensor blockchain
Node Layer: Comprises the different types of nodes that perform specific roles
Storage Layer: Offers distributed storage for gradients, datasets, and model checkpoints
Communication Layer: Facilitates data exchange between nodes and storage

Node Types and Responsibilities

graph TD
    subgraph "Miner"
        M_Train["Model Training"]
        M_Grad["Gradient Computation"]
        M_Comp["Gradient Compression"]
        M_Post["Gradient Posting"]
    end

    subgraph "Validator"
        V_Gath["Gradient Gathering"]
        V_Eval["Gradient Evaluation"]
        V_Weight["Weight Setting"]
        V_Sync["Model Synchronization"]
    end

    subgraph "Aggregator"
        A_Gath["Gradient Collection"]
        A_Agg["Gradient Aggregation"]
        A_Post["Aggregation Posting"]
    end

    M_Train --> M_Grad --> M_Comp --> M_Post
    V_Gath --> V_Eval --> V_Weight --> V_Sync
    A_Gath --> A_Agg --> A_Post

    M_Post -.-> V_Gath
    M_Post -.-> A_Gath
    A_Post -.-> V_Sync

Sources:

Miner Neurons

Miners are responsible for training the language model and generating gradients. Their key functions include:

Loading training data from the dataset bucket
Performing forward and backward passes through the model
Compressing gradients using DCT-based compression
Uploading compressed gradients to the gradient bucket

Validator Neurons

Validators evaluate the quality of miners’ gradients and set weights on the blockchain. Their key functions include:

Gathering gradients from miners
Evaluating gradient quality by measuring loss improvement
Calculating and setting weights on the blockchain
Tracking miner performance with OpenSkill ratings

Aggregation Server

The aggregation server collects and combines gradients from miners, providing a consolidated source for model updates. Its key functions include:

Gathering gradients from miners
Aggregating gradients based on validator weights
Storing aggregated gradients for validators and miners to use

Data Flow and Gradient Processing

flowchart TD
    subgraph "Miner Processing"
        MT["Model Training"] --> GC["Gradient Computation"]
        GC --> WD["Weight Decay"]
        WD --> MM["Momentum Update"]
        MM --> DCTS["DCT Transform"]
        DCTS --> COMP["Top-K Compression"]
        COMP --> UP["Upload to R2"]
    end

    subgraph "Validator Processing"
        DL["Download Gradients"] --> DECOMP["Decompress"]
        DECOMP --> EVAL["Evaluate Improvement"]
        EVAL --> SCORE["Calculate Score"]
        SCORE --> WT["Set Weights"]
    end

    subgraph "Aggregator Processing"
        AG_DL["Download Gradients"] --> AG_WT["Apply Weights"]
        AG_WT --> AG_COMB["Combine Gradients"]
        AG_COMB --> AG_UP["Upload Aggregation"]
    end

    UP --> DL
    UP --> AG_DL
    AG_UP --> EVAL
    WT --> AG_WT

Sources:

Gradient Compression Mechanism

A key innovation in Templar is the DCT-based gradient compression system that enables efficient sharing of gradients across nodes.

Transform DCT: Converts gradients to frequency domain using Discrete Cosine Transform
Compression: Selects top-K coefficients for transmission
Decompression: Reconstructs gradients from compressed representation

flowchart LR
    subgraph "DCT Compression Pipeline"
        direction LR
        GRAD["Original Gradient"] --> ENCODE["DCT Transform\nTransformDCT.encode()"]
        ENCODE --> COMPRESS["Top-K Selection\nCompressDCT.compress()"]
        COMPRESS --> TRANSMIT["Transmit\nidxs + vals"]
    end

    subgraph "DCT Decompression Pipeline"
        direction LR
        RECEIVE["Receive\nidxs + vals"] --> DECOMPRESS["Reconstruction\nCompressDCT.decompress()"]
        DECOMPRESS --> DECODE["Inverse DCT\nTransformDCT.decode()"]
        DECODE --> RECONST["Reconstructed Gradient"]
    end

    TRANSMIT --> RECEIVE

Sources:

Communication System

The communication system is the backbone of Templar, enabling data exchange between nodes and storage. It provides a unified interface for all data operations.

classDiagram
    class Comms {
        +wallet: Wallet
        +bucket: Bucket
        +session: Session
        +client_semaphore: Semaphore
        +put(state_dict, uid, window, key)
        +get(uid, window, key)
        +gather(uids, window, key)
        +load_checkpoint(model, optimizer, scheduler)
        +save_checkpoint(model, optimizer, scheduler)
        +post_peer_list(peers, window)
        +get_peer_list()
    }

    class ChainManager {
        +commitments: Dict
        +peers: PeerArray
        +eval_peers: Dict
        +fetch_commitments()
        +get_commitment(uid)
        +try_commit(wallet, bucket)
        +update_peers_with_buckets()
    }

    class Bucket {
        +name: str
        +account_id: str
        +access_key_id: str
        +secret_access_key: str
    }

    Comms --|> ChainManager : inherits
    Comms --> Bucket : uses

Sources:

The Comms class provides the following key functions:

Data Exchange: Methods for storing and retrieving data from R2 storage
Gradient Gathering: Collects and processes gradients from multiple miners
Checkpoint Management: Saves and loads model checkpoints
Peer Management: Handles peer discovery and selection

The ChainManager is responsible for:

Chain Commitments: Manages commitments to the Bittensor blockchain
Bucket Authentication: Stores and retrieves bucket credentials
Peer Management: Tracks active peers and updates peer lists

Storage System

graph TD
    subgraph "R2 Storage"
        GB["Gradients Bucket"]
        DB["Dataset Bucket"]
        AB["Aggregator Bucket"]
        CP["Checkpoint Storage"]
    end

    subgraph "Gradient Files"
        GRAD_FILE["gradient-{window}-{uid}-v{version}.pt"]
    end

    subgraph "Dataset Files"
        DATA_FILE["page-{number}.parquet"]
    end

    subgraph "Aggregation Files"
        AGG_FILE["aggregation-{window}.pt"]
    end

    subgraph "Checkpoint Files"
        CKP_FILE["checkpoint-{window}-v{version}.pt"]
    end

    GB --> GRAD_FILE
    DB --> DATA_FILE
    AB --> AGG_FILE
    CP --> CKP_FILE

Sources:

Storage in Templar is handled through Cloudflare R2, which provides a scalable and distributed storage solution. The system uses four main buckets:

Gradients Bucket: Stores gradients uploaded by miners
Dataset Bucket: Contains training data in Parquet format
Aggregator Bucket: Stores aggregated gradients from the aggregation server
Checkpoint Storage: Maintains model checkpoints for recovery and synchronization

Integration with Bittensor Network

flowchart TD
    subgraph "Templar System"
        M["Miners"]
        V["Validators"]
        A["Aggregation Server"]
    end

    subgraph "Bittensor Network"
        SUB["Subtensor"]
        MG["Metagraph"]
        WT["Weight Setting"]
        STS["Stakes"]
    end

    M -->|"Register Hotkeys"| SUB
    V -->|"Register Hotkeys"| SUB
    V -->|"Set Weights"| WT
    SUB -->|"Update"| MG
    STS -->|"Influence"| MG
    MG -->|"Determine Rewards"| M
    MG -->|"Determine Influence"| V

Sources:

The Templar system integrates with the Bittensor network through several mechanisms:

Hotkey Registration: Miners and validators register their hotkeys on the subnet
Weight Setting: Validators set weights on the blockchain based on gradient quality
Metagraph Synchronization: Nodes periodically synchronize with the metagraph
Chain Commitments: Nodes commit bucket information to the chain for discovery

Window-Based Training Cycle

sequenceDiagram
    participant M as Miner
    participant R2 as R2 Storage
    participant V as Validator
    participant A as Aggregator
    participant BT as Bittensor

    Note over M,BT: Window N
    
    M->>M: Train model on dataset
    M->>M: Prepare gradients
    M->>R2: Upload compressed gradients
    
    V->>R2: Gather gradients
    V->>V: Evaluate gradients
    V->>BT: Set weights on chain
    
    A->>R2: Collect gradients
    A->>A: Aggregate gradients
    A->>R2: Store aggregated gradients
    
    Note over M,BT: Window N+1
    
    M->>R2: Load aggregated gradients
    M->>M: Update model parameters
    V->>R2: Load aggregated gradients
    V->>V: Update model parameters

Sources:

The Templar system operates on a window-based cycle, where each window corresponds to a specific number of blockchain blocks. The training process follows these steps:

Miners:
- Train models on data for the current window
- Generate and compress gradients
- Upload gradients to R2 storage
Validators:
- Gather gradients from miners
- Evaluate gradient quality
- Set weights on the blockchain
Aggregator:
- Collect gradients from miners
- Aggregate gradients based on weights
- Store aggregated result for next window
Next Window:
- All participants load aggregated gradients
- Update model parameters
- Begin next training cycle

Peer Selection and Management

flowchart TD
    subgraph "Validator"
        VP["Peer Selection"]
        VR["OpenSkill Ratings"]
        VW["Weight Setting"]
    end

    subgraph "Peer Management"
        PL["Peer List Generation"]
        PR["Peer Repository"]
        PC["Peer Commitment"]
    end

    subgraph "Nodes"
        M["Miners"]
        V["Other Validators"]
    end

    VR -->|"Influence"| VP
    VP -->|"Generate"| PL
    PL -->|"Store"| PR
    M -->|"Register"| PC
    V -->|"Register"| PC
    PC -->|"Inform"| PR
    PR -->|"Provide"| VP

Sources:

Peer selection and management is a critical aspect of the Templar system:

Peer Selection:
- Validators select peers based on OpenSkill ratings
- Ratings are calculated from gradient evaluation results
- Peer lists are periodically updated to maintain system health
Peer Commitment:
- Nodes commit their bucket information to the blockchain
- Commitments include account ID, access keys, and bucket names
- Other nodes discover peers by reading these commitments
Peer Update Mechanism:
- New peer lists are generated with a future effective window
- Nodes fetch peer lists and update them when appropriate
- Inactive peers are pruned and replaced with active ones

Checkpoint Management

flowchart TB
    subgraph "Checkpoint Creation"
        ST["Model State"]
        OPT["Optimizer State"]
        SCH["Scheduler State"]
        MOM["Momentum"]
        META["Metadata"]
    end

    subgraph "Checkpoint Management"
        S["Save Checkpoint"]
        L["Load Checkpoint"]
        V["Version Control"]
    end

    subgraph "Model Synchronization"
        CU["Catchup Process"]
        AG["Aggregation Loading"]
        SY["Sync Verification"]
    end

    ST --> S
    OPT --> S
    SCH --> S
    MOM --> S
    META --> S
    
    S --> V
    V --> L
    
    L --> CU
    AG --> CU
    CU --> SY

Sources:

The checkpoint management system in Templar ensures model consistency and enables recovery:

Checkpoint Creation:
- Saves model state, optimizer state, scheduler state, and momentum
- Includes metadata like current window and version information
- Checkpoints are created periodically based on configuration
Checkpoint Loading:
- Loads model state from checkpoints during initialization
- Supports version-based checkpoint selection
- Handles compatibility between different versions
Model Synchronization:
- Provides a catchup mechanism for nodes that are behind
- Loads aggregated gradients to sync with current state
- Verifies sync status between nodes

Configuration Management

The Templar system uses a centralized hyperparameter configuration system that controls various aspects of the distributed training process.

Parameter Category	Example Parameters	Description
Model Configuration	`hidden_size`, `num_hidden_layers`, `sequence_length`	Controls the structure and size of the language model
Training Parameters	`learning_rate`, `batch_size`, `momentum_decay`	Defines the training process behavior
System Configuration	`blocks_per_window`, `checkpoint_frequency`, `topk_compression`	Controls system-wide behavior and timing
Peer Management	`topk_peers`, `peer_replacement_frequency`, `minimum_peers`	Configures peer selection behavior
Evaluation Metrics	`openskill_beta`, `openskill_tau`, `binary_score_ma_alpha`	Parameters for scoring and rating miners

Sources:

Error Handling and Recovery

The Templar system implements several mechanisms for error handling and recovery:

Gradient Gathering Retries: When gathering gradients fails, nodes retry with exponential backoff
Catchup Mechanism: Nodes that are behind can catch up using aggregated gradients
Peer Replacement: Inactive or failing peers are automatically replaced
Checkpoint Recovery: Nodes can restore from checkpoints after failures
Connection Error Handling: R2 storage connections are retried and recreated on failure

These mechanisms ensure the system’s resilience in the face of network issues, node failures, and other adverse conditions.

Sources:

Summary

The Templar system architecture provides a robust framework for decentralized training of large language models. Key architectural features include:

Decentralized Node Structure: Separate roles for miners, validators, and aggregators
Efficient Gradient Sharing: DCT-based compression for bandwidth optimization
Distributed Storage: Cloudflare R2 for reliable data persistence
Blockchain Integration: Bittensor network for coordination and incentives
Window-Based Processing: Synchronized training cycles across all nodes
Checkpoint Management: Model preservation and recovery mechanisms

This architecture enables collaborative training across diverse compute resources while maintaining model consistency and providing appropriate incentives for participation.