Overview
This page provides an introduction to Templar, a decentralized training framework for large language models that leverages the Bittensor network to coordinate distributed training across heterogeneous compute resources connected via the internet.
Sources: pyproject.toml:5-9 , README.md:38-47
What is Templar?
Section titled “What is Templar?”Templar is a system for incentivized distributed training of large language models. It connects diverse computational nodes through a carefully designed incentive mechanism, enabling collaborative training while ensuring honest participation and quality contributions. The framework implements a peer-to-peer architecture where participants contribute their computational resources to train a shared model, with rewards proportional to the quality of their contributions.
Sources: README.md:38-47 , README.md:50-57
System Architecture
Section titled “System Architecture”High-Level Architecture
Section titled “High-Level Architecture”graph TD subgraph "Bittensor Network" BT["BitTensor Blockchain"] end subgraph "Participant Nodes" MN["Miner<br/>(neurons/miner.py)"] VL["Validator<br/>(neurons/validator.py)"] AG["Aggregator<br/>(neurons/aggregator.py)"] end subgraph "Storage Layer" R2["Cloudflare R2 Storage"] subgraph "Buckets" GB["Gradients Bucket"] DB["Dataset Bucket"] AB["Aggregator Bucket"] end end subgraph "Monitoring" WB["Weights & Biases"] IF["InfluxDB"] LK["Loki Logging"] end MN <-->|"Set/get weights<br/>tplr.comms.ChainManager"| BT VL <-->|"Set/get weights<br/>tplr.comms.ChainManager"| BT MN -->|"Upload gradients<br/>comms.put()"| GB MN <-----|"Get datasets<br/>R2DatasetLoader"| DB MN <-----|"Gather peer gradients<br/>comms.gather()"| GB VL -->|"Upload evaluations<br/>comms.put()"| GB VL <-----|"Evaluate miner gradients<br/>comms.gather()"| GB AG <-----|"Gather & process gradients<br/>comms.gather()"| GB AG -->|"Store aggregated state<br/>comms.save_checkpoint()"| AB MN -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| WB VL -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| WB MN -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| IF VL -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| IF MN -.->|"Log events<br/>tplr.logger"| LK VL -.->|"Log events<br/>tplr.logger"| LK
Sources: neurons/miner.py:65-755 , neurons/validator.py:85-775 , src/tplr/comms.py:64-221 , src/tplr/comms.py:383-414
Gradient Exchange Workflow
Section titled “Gradient Exchange Workflow”sequenceDiagram participant M as "Miner<br/>(neurons/miner.py)" participant R2 as "R2 Storage<br/>(tplr.comms.Comms)" participant V as "Validator<br/>(neurons/validator.py)" participant BT as "Bittensor<br/>(Chain)" Note over M,BT: Window N begins (current_window = N) M->>M: window_step = 0 M->>R2: Load dataset pages for window (R2DatasetLoader.next_pages) M->>M: Train model on dataset batches M->>M: Accumulate gradients (model.backward()) M->>M: Compress gradients (tplr.prepare_gradient_dict) M->>R2: Upload compressed gradients (comms.put) V->>R2: Gather miner gradients (comms.gather) V->>V: Evaluate gradient quality (evaluate_model_on_batches) V->>V: Calculate scores (loss_before - loss_after) V->>V: Update OpenSkill ratings (update_openskill_ratings) V->>BT: Set weights on blockchain Note over M,BT: Window N+1 begins (current_window = N+1) M->>R2: Gather peer gradients (comms.gather) M->>M: Decompress and apply gradients M->>M: Update model (optimizer.step) M->>M: global_step += 1
Sources: neurons/miner.py:225-255 , neurons/validator.py:490-514 , neurons/validator.py:516-775
Core Components
Section titled “Core Components”Miners
Section titled “Miners”Miners train the model on assigned data subsets and share their gradients with peers. They:
- Load a subset of the dataset based on the current window and their UID
- Perform forward and backward passes to compute gradients
- Compress gradients using DCT transform and top-k selection
- Upload compressed gradients to R2 storage
- Gather and apply peer gradients to update their model
- Progress to the next window
The Miner
class in neurons/miner.py implements this functionality, with its main loop in the asynchronous run()
method.
Sources: neurons/miner.py:65-229 , README.md:64-117
Validators
Section titled “Validators”Validators evaluate miners’ gradient contributions and set weights on the blockchain. They:
- Gather gradients submitted by miners
- Evaluate each miner’s contribution by measuring loss improvement
- Calculate scores based on the performance improvement
- Update OpenSkill ratings for miners
- Set weights on the blockchain to influence reward distribution
The Validator
class in neurons/validator.py implements this functionality, with evaluation logic in evaluate_model_on_batches()
and weight setting in update_weights()
.
Sources: neurons/validator.py:85-144 , neurons/validator.py:356-437 , README.md:140-184
Communication System
Section titled “Communication System”The communication system, implemented in the Comms
class, handles data exchange between nodes:
- Gradient Exchange: Efficient transfer of compressed gradients
- Dataset Access: Loading training data from R2 storage
- Checkpoint Management: Saving and loading model states
- Blockchain Integration: Setting and getting weights on the Bittensor network
Key methods include put()
for uploading data, gather()
for collecting peer gradients, and s3_get_object()
/s3_put_object()
for R2 storage operations.
Sources: src/tplr/comms.py:64-221 , src/tplr/comms.py:322-371
Gradient Compression
Section titled “Gradient Compression”To reduce communication overhead, Templar uses:
- DCT Transform: Converting gradients to frequency domain
- Top-K Selection: Keeping only the most significant coefficients
- Momentum Tracking: Maintaining gradient momentum between updates
This compression is handled by TransformDCT
and CompressDCT
classes (imported in miners and validators).
Sources: neurons/miner.py:130-147 , neurons/validator.py:159-175
Storage Architecture
Section titled “Storage Architecture”graph TD subgraph "R2Storage" GB["Gradients Bucket<br/>(comms.bucket)"] DB["Dataset Bucket<br/>(R2DatasetLoader)"] AB["Aggregator Bucket<br/>(comms.load_aggregation)"] end subgraph "DataFlow" M["Miner.run()"] V["Validator.run()"] A["Aggregator"] end M -->|"comms.put()<br/>gradient-{window}-{uid}.pt"| GB M <--->|"R2DatasetLoader.next_pages()<br/>data_{page_number}.parquet"| DB M <--->|"comms.gather()<br/>gradient-{window}-{uid}.pt"| GB V -->|"comms.put_peer_list()<br/>peers_{window}.json"| GB V <--->|"comms.gather()<br/>gradient-{window}-{uid}.pt"| GB V <--->|"R2DatasetLoader.next_pages()<br/>data_{page_number}.parquet"| DB A -->|"comms.put()<br/>aggregation-{window}.pt"| AB V <--->|"comms.load_aggregation()<br/>aggregation-{window}.pt"| AB M <--->|"comms.load_aggregation()<br/>aggregation-{window}.pt"| AB
Sources: src/tplr/comms.py:174-220 , neurons/miner.py:339-350 , neurons/validator.py:832-858
System Configuration
Section titled “System Configuration”Templar uses a configuration system with parameters defined in hparams.json . Key parameters include:
Parameter | Description | Default Value |
---|---|---|
topk_compression | Compression ratio for gradients | 32 |
blocks_per_window | Blockchain blocks per training window | 7 |
pages_per_window | Dataset pages to process per window | 6 |
batch_size | Training batch size | 6 |
learning_rate | Base learning rate | 4e-4 |
checkpoint_frequency | Windows between checkpoint saves | 100 |
validator_offset | Windows validators lag behind miners | 2 |
Sources: hparams.json:1-53
Incentive Mechanism
Section titled “Incentive Mechanism”The incentive system aligns individual miner incentives with the collective goal of improving model performance:
- Gradient Evaluation: Validators compute a score based on loss improvement from each miner’s gradient
- OpenSkill Ratings: Miners are rated using the PlackettLuce model based on their contributions
- Weight Setting: Weights on the blockchain are updated based on these ratings
- Reward Distribution: Miners receive rewards proportional to their assigned weights
This mechanism encourages honest participation and quality contributions.
Sources: neurons/validator.py:356-437 , README.md:216-270
Implementation Details
Section titled “Implementation Details”Templar is implemented in Python and relies on several key libraries:
- PyTorch: For model definition and training (LlamaForCausalLM)
- Bittensor: For blockchain integration and incentive mechanism
- Cloudflare R2: For distributed storage
- OpenSkill: For fair rating of miner contributions
- WandB/InfluxDB/Loki: For monitoring and telemetry
The implementation follows an asynchronous model, with extensive use of Python’s asyncio
for handling concurrent operations.
Sources: pyproject.toml:11-36 , src/tplr/__init__.py:22-36
Related Pages
Section titled “Related Pages”- For detailed miner functionality, see Miners
- For validator operations, see Validators
- For aggregation server, see Aggregation Server
- For communication system details, see Communication System
- For system architecture, see System Architecture