Skip to content

System Architecture

Relevant Source Files

This document provides a comprehensive overview of the Templar framework’s architecture, outlining its core components and their interactions. Templar is a decentralized training framework built on the Bittensor network that enables collaborative training of large language models across distributed compute resources.

For information about the incentive mechanisms that drive the distributed training process, see Incentive Design.

graph TD
    subgraph "Network Layer"
        BN["Bittensor Network"]
        MG["Metagraph"]
    end

    subgraph "Node Layer"
        M["Miners"]
        V["Validators"]
        A["AggregationServer"]
    end

    subgraph "Storage Layer"
        R2["Cloudflare R2 Storage"]
        subgraph "Buckets"
            GB["Gradients Bucket"]
            DB["Dataset Bucket"]
            AB["Aggregator Bucket"]
            CP["Checkpoint Storage"]
        end
    end

    subgraph "Communication Layer"
        CM["Comms Module"]
        CH["Chain Manager"]
    end

    M <--> BN
    V <--> BN
    BN <--> MG

    M <--> CM
    V <--> CM
    A <--> CM
    
    CM <--> CH
    CH <--> BN

    GB <--> CM
    DB <--> CM
    AB <--> CM
    CP <--> CM

    style Network fill:#f9f9f9
    style Node fill:#f9f9f9
    style Storage fill:#f9f9f9
    style Communication fill:#f9f9f9

Sources:

The Templar system consists of four main architectural layers:

  1. Network Layer: Provides the decentralized infrastructure via Bittensor blockchain
  2. Node Layer: Comprises the different types of nodes that perform specific roles
  3. Storage Layer: Offers distributed storage for gradients, datasets, and model checkpoints
  4. Communication Layer: Facilitates data exchange between nodes and storage
graph TD
    subgraph "Miner"
        M_Train["Model Training"]
        M_Grad["Gradient Computation"]
        M_Comp["Gradient Compression"]
        M_Post["Gradient Posting"]
    end

    subgraph "Validator"
        V_Gath["Gradient Gathering"]
        V_Eval["Gradient Evaluation"]
        V_Weight["Weight Setting"]
        V_Sync["Model Synchronization"]
    end

    subgraph "Aggregator"
        A_Gath["Gradient Collection"]
        A_Agg["Gradient Aggregation"]
        A_Post["Aggregation Posting"]
    end

    M_Train --> M_Grad --> M_Comp --> M_Post
    V_Gath --> V_Eval --> V_Weight --> V_Sync
    A_Gath --> A_Agg --> A_Post

    M_Post -.-> V_Gath
    M_Post -.-> A_Gath
    A_Post -.-> V_Sync

Sources:

Miners are responsible for training the language model and generating gradients. Their key functions include:

  1. Loading training data from the dataset bucket
  2. Performing forward and backward passes through the model
  3. Compressing gradients using DCT-based compression
  4. Uploading compressed gradients to the gradient bucket

Validators evaluate the quality of miners’ gradients and set weights on the blockchain. Their key functions include:

  1. Gathering gradients from miners
  2. Evaluating gradient quality by measuring loss improvement
  3. Calculating and setting weights on the blockchain
  4. Tracking miner performance with OpenSkill ratings

The aggregation server collects and combines gradients from miners, providing a consolidated source for model updates. Its key functions include:

  1. Gathering gradients from miners
  2. Aggregating gradients based on validator weights
  3. Storing aggregated gradients for validators and miners to use
flowchart TD
    subgraph "Miner Processing"
        MT["Model Training"] --> GC["Gradient Computation"]
        GC --> WD["Weight Decay"]
        WD --> MM["Momentum Update"]
        MM --> DCTS["DCT Transform"]
        DCTS --> COMP["Top-K Compression"]
        COMP --> UP["Upload to R2"]
    end

    subgraph "Validator Processing"
        DL["Download Gradients"] --> DECOMP["Decompress"]
        DECOMP --> EVAL["Evaluate Improvement"]
        EVAL --> SCORE["Calculate Score"]
        SCORE --> WT["Set Weights"]
    end

    subgraph "Aggregator Processing"
        AG_DL["Download Gradients"] --> AG_WT["Apply Weights"]
        AG_WT --> AG_COMB["Combine Gradients"]
        AG_COMB --> AG_UP["Upload Aggregation"]
    end

    UP --> DL
    UP --> AG_DL
    AG_UP --> EVAL
    WT --> AG_WT

Sources:

A key innovation in Templar is the DCT-based gradient compression system that enables efficient sharing of gradients across nodes.

  1. Transform DCT: Converts gradients to frequency domain using Discrete Cosine Transform
  2. Compression: Selects top-K coefficients for transmission
  3. Decompression: Reconstructs gradients from compressed representation
flowchart LR
    subgraph "DCT Compression Pipeline"
        direction LR
        GRAD["Original Gradient"] --> ENCODE["DCT Transform\nTransformDCT.encode()"]
        ENCODE --> COMPRESS["Top-K Selection\nCompressDCT.compress()"]
        COMPRESS --> TRANSMIT["Transmit\nidxs + vals"]
    end

    subgraph "DCT Decompression Pipeline"
        direction LR
        RECEIVE["Receive\nidxs + vals"] --> DECOMPRESS["Reconstruction\nCompressDCT.decompress()"]
        DECOMPRESS --> DECODE["Inverse DCT\nTransformDCT.decode()"]
        DECODE --> RECONST["Reconstructed Gradient"]
    end

    TRANSMIT --> RECEIVE

Sources:

The communication system is the backbone of Templar, enabling data exchange between nodes and storage. It provides a unified interface for all data operations.

classDiagram
    class Comms {
        +wallet: Wallet
        +bucket: Bucket
        +session: Session
        +client_semaphore: Semaphore
        +put(state_dict, uid, window, key)
        +get(uid, window, key)
        +gather(uids, window, key)
        +load_checkpoint(model, optimizer, scheduler)
        +save_checkpoint(model, optimizer, scheduler)
        +post_peer_list(peers, window)
        +get_peer_list()
    }

    class ChainManager {
        +commitments: Dict
        +peers: PeerArray
        +eval_peers: Dict
        +fetch_commitments()
        +get_commitment(uid)
        +try_commit(wallet, bucket)
        +update_peers_with_buckets()
    }

    class Bucket {
        +name: str
        +account_id: str
        +access_key_id: str
        +secret_access_key: str
    }

    Comms --|> ChainManager : inherits
    Comms --> Bucket : uses

Sources:

The Comms class provides the following key functions:

  1. Data Exchange: Methods for storing and retrieving data from R2 storage
  2. Gradient Gathering: Collects and processes gradients from multiple miners
  3. Checkpoint Management: Saves and loads model checkpoints
  4. Peer Management: Handles peer discovery and selection

The ChainManager is responsible for:

  1. Chain Commitments: Manages commitments to the Bittensor blockchain
  2. Bucket Authentication: Stores and retrieves bucket credentials
  3. Peer Management: Tracks active peers and updates peer lists
graph TD
    subgraph "R2 Storage"
        GB["Gradients Bucket"]
        DB["Dataset Bucket"]
        AB["Aggregator Bucket"]
        CP["Checkpoint Storage"]
    end

    subgraph "Gradient Files"
        GRAD_FILE["gradient-{window}-{uid}-v{version}.pt"]
    end

    subgraph "Dataset Files"
        DATA_FILE["page-{number}.parquet"]
    end

    subgraph "Aggregation Files"
        AGG_FILE["aggregation-{window}.pt"]
    end

    subgraph "Checkpoint Files"
        CKP_FILE["checkpoint-{window}-v{version}.pt"]
    end

    GB --> GRAD_FILE
    DB --> DATA_FILE
    AB --> AGG_FILE
    CP --> CKP_FILE

Sources:

Storage in Templar is handled through Cloudflare R2, which provides a scalable and distributed storage solution. The system uses four main buckets:

  1. Gradients Bucket: Stores gradients uploaded by miners
  2. Dataset Bucket: Contains training data in Parquet format
  3. Aggregator Bucket: Stores aggregated gradients from the aggregation server
  4. Checkpoint Storage: Maintains model checkpoints for recovery and synchronization
flowchart TD
    subgraph "Templar System"
        M["Miners"]
        V["Validators"]
        A["Aggregation Server"]
    end

    subgraph "Bittensor Network"
        SUB["Subtensor"]
        MG["Metagraph"]
        WT["Weight Setting"]
        STS["Stakes"]
    end

    M -->|"Register Hotkeys"| SUB
    V -->|"Register Hotkeys"| SUB
    V -->|"Set Weights"| WT
    SUB -->|"Update"| MG
    STS -->|"Influence"| MG
    MG -->|"Determine Rewards"| M
    MG -->|"Determine Influence"| V

Sources:

The Templar system integrates with the Bittensor network through several mechanisms:

  1. Hotkey Registration: Miners and validators register their hotkeys on the subnet
  2. Weight Setting: Validators set weights on the blockchain based on gradient quality
  3. Metagraph Synchronization: Nodes periodically synchronize with the metagraph
  4. Chain Commitments: Nodes commit bucket information to the chain for discovery
sequenceDiagram
    participant M as Miner
    participant R2 as R2 Storage
    participant V as Validator
    participant A as Aggregator
    participant BT as Bittensor

    Note over M,BT: Window N
    
    M->>M: Train model on dataset
    M->>M: Prepare gradients
    M->>R2: Upload compressed gradients
    
    V->>R2: Gather gradients
    V->>V: Evaluate gradients
    V->>BT: Set weights on chain
    
    A->>R2: Collect gradients
    A->>A: Aggregate gradients
    A->>R2: Store aggregated gradients
    
    Note over M,BT: Window N+1
    
    M->>R2: Load aggregated gradients
    M->>M: Update model parameters
    V->>R2: Load aggregated gradients
    V->>V: Update model parameters

Sources:

The Templar system operates on a window-based cycle, where each window corresponds to a specific number of blockchain blocks. The training process follows these steps:

  1. Miners:

    • Train models on data for the current window
    • Generate and compress gradients
    • Upload gradients to R2 storage
  2. Validators:

    • Gather gradients from miners
    • Evaluate gradient quality
    • Set weights on the blockchain
  3. Aggregator:

    • Collect gradients from miners
    • Aggregate gradients based on weights
    • Store aggregated result for next window
  4. Next Window:

    • All participants load aggregated gradients
    • Update model parameters
    • Begin next training cycle
flowchart TD
    subgraph "Validator"
        VP["Peer Selection"]
        VR["OpenSkill Ratings"]
        VW["Weight Setting"]
    end

    subgraph "Peer Management"
        PL["Peer List Generation"]
        PR["Peer Repository"]
        PC["Peer Commitment"]
    end

    subgraph "Nodes"
        M["Miners"]
        V["Other Validators"]
    end

    VR -->|"Influence"| VP
    VP -->|"Generate"| PL
    PL -->|"Store"| PR
    M -->|"Register"| PC
    V -->|"Register"| PC
    PC -->|"Inform"| PR
    PR -->|"Provide"| VP

Sources:

Peer selection and management is a critical aspect of the Templar system:

  1. Peer Selection:

    • Validators select peers based on OpenSkill ratings
    • Ratings are calculated from gradient evaluation results
    • Peer lists are periodically updated to maintain system health
  2. Peer Commitment:

    • Nodes commit their bucket information to the blockchain
    • Commitments include account ID, access keys, and bucket names
    • Other nodes discover peers by reading these commitments
  3. Peer Update Mechanism:

    • New peer lists are generated with a future effective window
    • Nodes fetch peer lists and update them when appropriate
    • Inactive peers are pruned and replaced with active ones
flowchart TB
    subgraph "Checkpoint Creation"
        ST["Model State"]
        OPT["Optimizer State"]
        SCH["Scheduler State"]
        MOM["Momentum"]
        META["Metadata"]
    end

    subgraph "Checkpoint Management"
        S["Save Checkpoint"]
        L["Load Checkpoint"]
        V["Version Control"]
    end

    subgraph "Model Synchronization"
        CU["Catchup Process"]
        AG["Aggregation Loading"]
        SY["Sync Verification"]
    end

    ST --> S
    OPT --> S
    SCH --> S
    MOM --> S
    META --> S
    
    S --> V
    V --> L
    
    L --> CU
    AG --> CU
    CU --> SY

Sources:

The checkpoint management system in Templar ensures model consistency and enables recovery:

  1. Checkpoint Creation:

    • Saves model state, optimizer state, scheduler state, and momentum
    • Includes metadata like current window and version information
    • Checkpoints are created periodically based on configuration
  2. Checkpoint Loading:

    • Loads model state from checkpoints during initialization
    • Supports version-based checkpoint selection
    • Handles compatibility between different versions
  3. Model Synchronization:

    • Provides a catchup mechanism for nodes that are behind
    • Loads aggregated gradients to sync with current state
    • Verifies sync status between nodes

The Templar system uses a centralized hyperparameter configuration system that controls various aspects of the distributed training process.

Parameter CategoryExample ParametersDescription
Model Configurationhidden_size, num_hidden_layers, sequence_lengthControls the structure and size of the language model
Training Parameterslearning_rate, batch_size, momentum_decayDefines the training process behavior
System Configurationblocks_per_window, checkpoint_frequency, topk_compressionControls system-wide behavior and timing
Peer Managementtopk_peers, peer_replacement_frequency, minimum_peersConfigures peer selection behavior
Evaluation Metricsopenskill_beta, openskill_tau, binary_score_ma_alphaParameters for scoring and rating miners

Sources:

The Templar system implements several mechanisms for error handling and recovery:

  1. Gradient Gathering Retries: When gathering gradients fails, nodes retry with exponential backoff
  2. Catchup Mechanism: Nodes that are behind can catch up using aggregated gradients
  3. Peer Replacement: Inactive or failing peers are automatically replaced
  4. Checkpoint Recovery: Nodes can restore from checkpoints after failures
  5. Connection Error Handling: R2 storage connections are retried and recreated on failure

These mechanisms ensure the system’s resilience in the face of network issues, node failures, and other adverse conditions.

Sources:

The Templar system architecture provides a robust framework for decentralized training of large language models. Key architectural features include:

  1. Decentralized Node Structure: Separate roles for miners, validators, and aggregators
  2. Efficient Gradient Sharing: DCT-based compression for bandwidth optimization
  3. Distributed Storage: Cloudflare R2 for reliable data persistence
  4. Blockchain Integration: Bittensor network for coordination and incentives
  5. Window-Based Processing: Synchronized training cycles across all nodes
  6. Checkpoint Management: Model preservation and recovery mechanisms

This architecture enables collaborative training across diverse compute resources while maintaining model consistency and providing appropriate incentives for participation.