Overview

Relevant Source Files

README.md docs/miner.md docs/validator.md ecosystem.config.js hparams.json neurons/miner.py neurons/validator.py pyproject.toml src/tplr/__init__.py src/tplr/comms.py tests/test_comms.py uv.lock

This page provides an introduction to Templar, a decentralized training framework for large language models that leverages the Bittensor network to coordinate distributed training across heterogeneous compute resources connected via the internet.

Sources: pyproject.toml:5-9 , README.md:38-47

What is Templar?

Templar is a system for incentivized distributed training of large language models. It connects diverse computational nodes through a carefully designed incentive mechanism, enabling collaborative training while ensuring honest participation and quality contributions. The framework implements a peer-to-peer architecture where participants contribute their computational resources to train a shared model, with rewards proportional to the quality of their contributions.

Sources: README.md:38-47 , README.md:50-57

System Architecture

High-Level Architecture

graph TD
    subgraph "Bittensor Network"
        BT["BitTensor Blockchain"]
    end
    
    subgraph "Participant Nodes"
        MN["Miner<br/>(neurons/miner.py)"]
        VL["Validator<br/>(neurons/validator.py)"]
        AG["Aggregator<br/>(neurons/aggregator.py)"]
    end
    
    subgraph "Storage Layer"
        R2["Cloudflare R2 Storage"]
        subgraph "Buckets"
            GB["Gradients Bucket"]
            DB["Dataset Bucket"]
            AB["Aggregator Bucket"]
        end
    end
    
    subgraph "Monitoring"
        WB["Weights & Biases"]
        IF["InfluxDB"]
        LK["Loki Logging"]
    end
    
    MN <-->|"Set/get weights<br/>tplr.comms.ChainManager"| BT
    VL <-->|"Set/get weights<br/>tplr.comms.ChainManager"| BT
    
    MN -->|"Upload gradients<br/>comms.put()"| GB
    MN <-----|"Get datasets<br/>R2DatasetLoader"| DB
    MN <-----|"Gather peer gradients<br/>comms.gather()"| GB
    VL -->|"Upload evaluations<br/>comms.put()"| GB
    VL <-----|"Evaluate miner gradients<br/>comms.gather()"| GB
    AG <-----|"Gather & process gradients<br/>comms.gather()"| GB
    AG -->|"Store aggregated state<br/>comms.save_checkpoint()"| AB
    
    MN -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| WB
    VL -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| WB
    MN -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| IF
    VL -.->|"Log metrics<br/>tplr.metrics.MetricsLogger"| IF
    MN -.->|"Log events<br/>tplr.logger"| LK
    VL -.->|"Log events<br/>tplr.logger"| LK

Sources: neurons/miner.py:65-755 , neurons/validator.py:85-775 , src/tplr/comms.py:64-221 , src/tplr/comms.py:383-414

Gradient Exchange Workflow

sequenceDiagram
    participant M as "Miner<br/>(neurons/miner.py)"
    participant R2 as "R2 Storage<br/>(tplr.comms.Comms)"
    participant V as "Validator<br/>(neurons/validator.py)"
    participant BT as "Bittensor<br/>(Chain)"
    
    Note over M,BT: Window N begins (current_window = N)
    
    M->>M: window_step = 0
    M->>R2: Load dataset pages for window (R2DatasetLoader.next_pages)
    M->>M: Train model on dataset batches
    M->>M: Accumulate gradients (model.backward())
    M->>M: Compress gradients (tplr.prepare_gradient_dict)
    M->>R2: Upload compressed gradients (comms.put)
    
    V->>R2: Gather miner gradients (comms.gather)
    V->>V: Evaluate gradient quality (evaluate_model_on_batches)
    V->>V: Calculate scores (loss_before - loss_after)
    V->>V: Update OpenSkill ratings (update_openskill_ratings)
    V->>BT: Set weights on blockchain
    
    Note over M,BT: Window N+1 begins (current_window = N+1)
    
    M->>R2: Gather peer gradients (comms.gather)
    M->>M: Decompress and apply gradients
    M->>M: Update model (optimizer.step)
    M->>M: global_step += 1

Sources: neurons/miner.py:225-255 , neurons/validator.py:490-514 , neurons/validator.py:516-775

Core Components

Miners

Miners train the model on assigned data subsets and share their gradients with peers. They:

Load a subset of the dataset based on the current window and their UID
Perform forward and backward passes to compute gradients
Compress gradients using DCT transform and top-k selection
Upload compressed gradients to R2 storage
Gather and apply peer gradients to update their model
Progress to the next window

The Miner class in neurons/miner.py implements this functionality, with its main loop in the asynchronous run() method.

Sources: neurons/miner.py:65-229 , README.md:64-117

Validators

Validators evaluate miners’ gradient contributions and set weights on the blockchain. They:

Gather gradients submitted by miners
Evaluate each miner’s contribution by measuring loss improvement
Calculate scores based on the performance improvement
Update OpenSkill ratings for miners
Set weights on the blockchain to influence reward distribution

The Validator class in neurons/validator.py implements this functionality, with evaluation logic in evaluate_model_on_batches() and weight setting in update_weights().

Sources: neurons/validator.py:85-144 , neurons/validator.py:356-437 , README.md:140-184

Communication System

The communication system, implemented in the Comms class, handles data exchange between nodes:

Gradient Exchange: Efficient transfer of compressed gradients
Dataset Access: Loading training data from R2 storage
Checkpoint Management: Saving and loading model states
Blockchain Integration: Setting and getting weights on the Bittensor network

Key methods include put() for uploading data, gather() for collecting peer gradients, and s3_get_object()/s3_put_object() for R2 storage operations.

Sources: src/tplr/comms.py:64-221 , src/tplr/comms.py:322-371

Gradient Compression

To reduce communication overhead, Templar uses:

DCT Transform: Converting gradients to frequency domain
Top-K Selection: Keeping only the most significant coefficients
Momentum Tracking: Maintaining gradient momentum between updates

This compression is handled by TransformDCT and CompressDCT classes (imported in miners and validators).

Sources: neurons/miner.py:130-147 , neurons/validator.py:159-175

Storage Architecture

graph TD
    subgraph "R2Storage"
        GB["Gradients Bucket<br/>(comms.bucket)"]
        DB["Dataset Bucket<br/>(R2DatasetLoader)"]
        AB["Aggregator Bucket<br/>(comms.load_aggregation)"]
    end
    
    subgraph "DataFlow"
        M["Miner.run()"]
        V["Validator.run()"]
        A["Aggregator"]
    end
    
    M -->|"comms.put()<br/>gradient-{window}-{uid}.pt"| GB
    M <--->|"R2DatasetLoader.next_pages()<br/>data_{page_number}.parquet"| DB
    M <--->|"comms.gather()<br/>gradient-{window}-{uid}.pt"| GB
    
    V -->|"comms.put_peer_list()<br/>peers_{window}.json"| GB
    V <--->|"comms.gather()<br/>gradient-{window}-{uid}.pt"| GB
    V <--->|"R2DatasetLoader.next_pages()<br/>data_{page_number}.parquet"| DB
    
    A -->|"comms.put()<br/>aggregation-{window}.pt"| AB
    
    V <--->|"comms.load_aggregation()<br/>aggregation-{window}.pt"| AB
    M <--->|"comms.load_aggregation()<br/>aggregation-{window}.pt"| AB

Sources: src/tplr/comms.py:174-220 , neurons/miner.py:339-350 , neurons/validator.py:832-858

System Configuration

Templar uses a configuration system with parameters defined in hparams.json . Key parameters include:

Parameter	Description	Default Value
`topk_compression`	Compression ratio for gradients	32
`blocks_per_window`	Blockchain blocks per training window	7
`pages_per_window`	Dataset pages to process per window	6
`batch_size`	Training batch size	6
`learning_rate`	Base learning rate	4e-4
`checkpoint_frequency`	Windows between checkpoint saves	100
`validator_offset`	Windows validators lag behind miners	2

Sources: hparams.json:1-53

Incentive Mechanism

The incentive system aligns individual miner incentives with the collective goal of improving model performance:

Gradient Evaluation: Validators compute a score based on loss improvement from each miner’s gradient
OpenSkill Ratings: Miners are rated using the PlackettLuce model based on their contributions
Weight Setting: Weights on the blockchain are updated based on these ratings
Reward Distribution: Miners receive rewards proportional to their assigned weights

This mechanism encourages honest participation and quality contributions.

Sources: neurons/validator.py:356-437 , README.md:216-270

Implementation Details

Templar is implemented in Python and relies on several key libraries:

PyTorch: For model definition and training (LlamaForCausalLM)
Bittensor: For blockchain integration and incentive mechanism
Cloudflare R2: For distributed storage
OpenSkill: For fair rating of miner contributions
WandB/InfluxDB/Loki: For monitoring and telemetry

The implementation follows an asynchronous model, with extensive use of Python’s asyncio for handling concurrent operations.

Sources: pyproject.toml:11-36 , src/tplr/__init__.py:22-36

For detailed miner functionality, see Miners
For validator operations, see Validators
For aggregation server, see Aggregation Server
For communication system details, see Communication System
For system architecture, see System Architecture