System Architecture
This document provides a comprehensive overview of the Templar framework’s architecture, outlining its core components and their interactions. Templar is a decentralized training framework built on the Bittensor network that enables collaborative training of large language models across distributed compute resources.
For information about the incentive mechanisms that drive the distributed training process, see Incentive Design.
Core Components
Section titled “Core Components”graph TD subgraph "Network Layer" BN["Bittensor Network"] MG["Metagraph"] end subgraph "Node Layer" M["Miners"] V["Validators"] A["AggregationServer"] end subgraph "Storage Layer" R2["Cloudflare R2 Storage"] subgraph "Buckets" GB["Gradients Bucket"] DB["Dataset Bucket"] AB["Aggregator Bucket"] CP["Checkpoint Storage"] end end subgraph "Communication Layer" CM["Comms Module"] CH["Chain Manager"] end M <--> BN V <--> BN BN <--> MG M <--> CM V <--> CM A <--> CM CM <--> CH CH <--> BN GB <--> CM DB <--> CM AB <--> CM CP <--> CM style Network fill:#f9f9f9 style Node fill:#f9f9f9 style Storage fill:#f9f9f9 style Communication fill:#f9f9f9
Sources:
- neurons/miner.py:124-184
- neurons/validator.py:173-222
- neurons/aggregator.py:120-132
- src/tplr/comms.py:64-122
The Templar system consists of four main architectural layers:
- Network Layer: Provides the decentralized infrastructure via Bittensor blockchain
- Node Layer: Comprises the different types of nodes that perform specific roles
- Storage Layer: Offers distributed storage for gradients, datasets, and model checkpoints
- Communication Layer: Facilitates data exchange between nodes and storage
Node Types and Responsibilities
Section titled “Node Types and Responsibilities”graph TD subgraph "Miner" M_Train["Model Training"] M_Grad["Gradient Computation"] M_Comp["Gradient Compression"] M_Post["Gradient Posting"] end subgraph "Validator" V_Gath["Gradient Gathering"] V_Eval["Gradient Evaluation"] V_Weight["Weight Setting"] V_Sync["Model Synchronization"] end subgraph "Aggregator" A_Gath["Gradient Collection"] A_Agg["Gradient Aggregation"] A_Post["Aggregation Posting"] end M_Train --> M_Grad --> M_Comp --> M_Post V_Gath --> V_Eval --> V_Weight --> V_Sync A_Gath --> A_Agg --> A_Post M_Post -.-> V_Gath M_Post -.-> A_Gath A_Post -.-> V_Sync
Sources:
Miner Neurons
Section titled “Miner Neurons”Miners are responsible for training the language model and generating gradients. Their key functions include:
- Loading training data from the dataset bucket
- Performing forward and backward passes through the model
- Compressing gradients using DCT-based compression
- Uploading compressed gradients to the gradient bucket
Validator Neurons
Section titled “Validator Neurons”Validators evaluate the quality of miners’ gradients and set weights on the blockchain. Their key functions include:
- Gathering gradients from miners
- Evaluating gradient quality by measuring loss improvement
- Calculating and setting weights on the blockchain
- Tracking miner performance with OpenSkill ratings
Aggregation Server
Section titled “Aggregation Server”The aggregation server collects and combines gradients from miners, providing a consolidated source for model updates. Its key functions include:
- Gathering gradients from miners
- Aggregating gradients based on validator weights
- Storing aggregated gradients for validators and miners to use
Data Flow and Gradient Processing
Section titled “Data Flow and Gradient Processing”flowchart TD subgraph "Miner Processing" MT["Model Training"] --> GC["Gradient Computation"] GC --> WD["Weight Decay"] WD --> MM["Momentum Update"] MM --> DCTS["DCT Transform"] DCTS --> COMP["Top-K Compression"] COMP --> UP["Upload to R2"] end subgraph "Validator Processing" DL["Download Gradients"] --> DECOMP["Decompress"] DECOMP --> EVAL["Evaluate Improvement"] EVAL --> SCORE["Calculate Score"] SCORE --> WT["Set Weights"] end subgraph "Aggregator Processing" AG_DL["Download Gradients"] --> AG_WT["Apply Weights"] AG_WT --> AG_COMB["Combine Gradients"] AG_COMB --> AG_UP["Upload Aggregation"] end UP --> DL UP --> AG_DL AG_UP --> EVAL WT --> AG_WT
Sources:
Gradient Compression Mechanism
Section titled “Gradient Compression Mechanism”A key innovation in Templar is the DCT-based gradient compression system that enables efficient sharing of gradients across nodes.
- Transform DCT: Converts gradients to frequency domain using Discrete Cosine Transform
- Compression: Selects top-K coefficients for transmission
- Decompression: Reconstructs gradients from compressed representation
flowchart LR subgraph "DCT Compression Pipeline" direction LR GRAD["Original Gradient"] --> ENCODE["DCT Transform\nTransformDCT.encode()"] ENCODE --> COMPRESS["Top-K Selection\nCompressDCT.compress()"] COMPRESS --> TRANSMIT["Transmit\nidxs + vals"] end subgraph "DCT Decompression Pipeline" direction LR RECEIVE["Receive\nidxs + vals"] --> DECOMPRESS["Reconstruction\nCompressDCT.decompress()"] DECOMPRESS --> DECODE["Inverse DCT\nTransformDCT.decode()"] DECODE --> RECONST["Reconstructed Gradient"] end TRANSMIT --> RECEIVE
Sources:
Communication System
Section titled “Communication System”The communication system is the backbone of Templar, enabling data exchange between nodes and storage. It provides a unified interface for all data operations.
classDiagram class Comms { +wallet: Wallet +bucket: Bucket +session: Session +client_semaphore: Semaphore +put(state_dict, uid, window, key) +get(uid, window, key) +gather(uids, window, key) +load_checkpoint(model, optimizer, scheduler) +save_checkpoint(model, optimizer, scheduler) +post_peer_list(peers, window) +get_peer_list() } class ChainManager { +commitments: Dict +peers: PeerArray +eval_peers: Dict +fetch_commitments() +get_commitment(uid) +try_commit(wallet, bucket) +update_peers_with_buckets() } class Bucket { +name: str +account_id: str +access_key_id: str +secret_access_key: str } Comms --|> ChainManager : inherits Comms --> Bucket : uses
Sources:
The Comms
class provides the following key functions:
- Data Exchange: Methods for storing and retrieving data from R2 storage
- Gradient Gathering: Collects and processes gradients from multiple miners
- Checkpoint Management: Saves and loads model checkpoints
- Peer Management: Handles peer discovery and selection
The ChainManager
is responsible for:
- Chain Commitments: Manages commitments to the Bittensor blockchain
- Bucket Authentication: Stores and retrieves bucket credentials
- Peer Management: Tracks active peers and updates peer lists
Storage System
Section titled “Storage System”graph TD subgraph "R2 Storage" GB["Gradients Bucket"] DB["Dataset Bucket"] AB["Aggregator Bucket"] CP["Checkpoint Storage"] end subgraph "Gradient Files" GRAD_FILE["gradient-{window}-{uid}-v{version}.pt"] end subgraph "Dataset Files" DATA_FILE["page-{number}.parquet"] end subgraph "Aggregation Files" AGG_FILE["aggregation-{window}.pt"] end subgraph "Checkpoint Files" CKP_FILE["checkpoint-{window}-v{version}.pt"] end GB --> GRAD_FILE DB --> DATA_FILE AB --> AGG_FILE CP --> CKP_FILE
Sources:
Storage in Templar is handled through Cloudflare R2, which provides a scalable and distributed storage solution. The system uses four main buckets:
- Gradients Bucket: Stores gradients uploaded by miners
- Dataset Bucket: Contains training data in Parquet format
- Aggregator Bucket: Stores aggregated gradients from the aggregation server
- Checkpoint Storage: Maintains model checkpoints for recovery and synchronization
Integration with Bittensor Network
Section titled “Integration with Bittensor Network”flowchart TD subgraph "Templar System" M["Miners"] V["Validators"] A["Aggregation Server"] end subgraph "Bittensor Network" SUB["Subtensor"] MG["Metagraph"] WT["Weight Setting"] STS["Stakes"] end M -->|"Register Hotkeys"| SUB V -->|"Register Hotkeys"| SUB V -->|"Set Weights"| WT SUB -->|"Update"| MG STS -->|"Influence"| MG MG -->|"Determine Rewards"| M MG -->|"Determine Influence"| V
Sources:
The Templar system integrates with the Bittensor network through several mechanisms:
- Hotkey Registration: Miners and validators register their hotkeys on the subnet
- Weight Setting: Validators set weights on the blockchain based on gradient quality
- Metagraph Synchronization: Nodes periodically synchronize with the metagraph
- Chain Commitments: Nodes commit bucket information to the chain for discovery
Window-Based Training Cycle
Section titled “Window-Based Training Cycle”sequenceDiagram participant M as Miner participant R2 as R2 Storage participant V as Validator participant A as Aggregator participant BT as Bittensor Note over M,BT: Window N M->>M: Train model on dataset M->>M: Prepare gradients M->>R2: Upload compressed gradients V->>R2: Gather gradients V->>V: Evaluate gradients V->>BT: Set weights on chain A->>R2: Collect gradients A->>A: Aggregate gradients A->>R2: Store aggregated gradients Note over M,BT: Window N+1 M->>R2: Load aggregated gradients M->>M: Update model parameters V->>R2: Load aggregated gradients V->>V: Update model parameters
Sources:
The Templar system operates on a window-based cycle, where each window corresponds to a specific number of blockchain blocks. The training process follows these steps:
-
Miners:
- Train models on data for the current window
- Generate and compress gradients
- Upload gradients to R2 storage
-
Validators:
- Gather gradients from miners
- Evaluate gradient quality
- Set weights on the blockchain
-
Aggregator:
- Collect gradients from miners
- Aggregate gradients based on weights
- Store aggregated result for next window
-
Next Window:
- All participants load aggregated gradients
- Update model parameters
- Begin next training cycle
Peer Selection and Management
Section titled “Peer Selection and Management”flowchart TD subgraph "Validator" VP["Peer Selection"] VR["OpenSkill Ratings"] VW["Weight Setting"] end subgraph "Peer Management" PL["Peer List Generation"] PR["Peer Repository"] PC["Peer Commitment"] end subgraph "Nodes" M["Miners"] V["Other Validators"] end VR -->|"Influence"| VP VP -->|"Generate"| PL PL -->|"Store"| PR M -->|"Register"| PC V -->|"Register"| PC PC -->|"Inform"| PR PR -->|"Provide"| VP
Sources:
Peer selection and management is a critical aspect of the Templar system:
-
Peer Selection:
- Validators select peers based on OpenSkill ratings
- Ratings are calculated from gradient evaluation results
- Peer lists are periodically updated to maintain system health
-
Peer Commitment:
- Nodes commit their bucket information to the blockchain
- Commitments include account ID, access keys, and bucket names
- Other nodes discover peers by reading these commitments
-
Peer Update Mechanism:
- New peer lists are generated with a future effective window
- Nodes fetch peer lists and update them when appropriate
- Inactive peers are pruned and replaced with active ones
Checkpoint Management
Section titled “Checkpoint Management”flowchart TB subgraph "Checkpoint Creation" ST["Model State"] OPT["Optimizer State"] SCH["Scheduler State"] MOM["Momentum"] META["Metadata"] end subgraph "Checkpoint Management" S["Save Checkpoint"] L["Load Checkpoint"] V["Version Control"] end subgraph "Model Synchronization" CU["Catchup Process"] AG["Aggregation Loading"] SY["Sync Verification"] end ST --> S OPT --> S SCH --> S MOM --> S META --> S S --> V V --> L L --> CU AG --> CU CU --> SY
Sources:
The checkpoint management system in Templar ensures model consistency and enables recovery:
-
Checkpoint Creation:
- Saves model state, optimizer state, scheduler state, and momentum
- Includes metadata like current window and version information
- Checkpoints are created periodically based on configuration
-
Checkpoint Loading:
- Loads model state from checkpoints during initialization
- Supports version-based checkpoint selection
- Handles compatibility between different versions
-
Model Synchronization:
- Provides a catchup mechanism for nodes that are behind
- Loads aggregated gradients to sync with current state
- Verifies sync status between nodes
Configuration Management
Section titled “Configuration Management”The Templar system uses a centralized hyperparameter configuration system that controls various aspects of the distributed training process.
Parameter Category | Example Parameters | Description |
---|---|---|
Model Configuration | hidden_size , num_hidden_layers , sequence_length | Controls the structure and size of the language model |
Training Parameters | learning_rate , batch_size , momentum_decay | Defines the training process behavior |
System Configuration | blocks_per_window , checkpoint_frequency , topk_compression | Controls system-wide behavior and timing |
Peer Management | topk_peers , peer_replacement_frequency , minimum_peers | Configures peer selection behavior |
Evaluation Metrics | openskill_beta , openskill_tau , binary_score_ma_alpha | Parameters for scoring and rating miners |
Sources:
Error Handling and Recovery
Section titled “Error Handling and Recovery”The Templar system implements several mechanisms for error handling and recovery:
- Gradient Gathering Retries: When gathering gradients fails, nodes retry with exponential backoff
- Catchup Mechanism: Nodes that are behind can catch up using aggregated gradients
- Peer Replacement: Inactive or failing peers are automatically replaced
- Checkpoint Recovery: Nodes can restore from checkpoints after failures
- Connection Error Handling: R2 storage connections are retried and recreated on failure
These mechanisms ensure the system’s resilience in the face of network issues, node failures, and other adverse conditions.
Sources:
Summary
Section titled “Summary”The Templar system architecture provides a robust framework for decentralized training of large language models. Key architectural features include:
- Decentralized Node Structure: Separate roles for miners, validators, and aggregators
- Efficient Gradient Sharing: DCT-based compression for bandwidth optimization
- Distributed Storage: Cloudflare R2 for reliable data persistence
- Blockchain Integration: Bittensor network for coordination and incentives
- Window-Based Processing: Synchronized training cycles across all nodes
- Checkpoint Management: Model preservation and recovery mechanisms
This architecture enables collaborative training across diverse compute resources while maintaining model consistency and providing appropriate incentives for participation.