Skip to content

Gradient Processing

Relevant Source Files

This page explains how miners in the Templar framework process and share gradients, including momentum, weight decay, and the compression techniques used to efficiently distribute updates across the network. For information about how validators evaluate these gradients, see Weight Setting.

Gradient processing is a critical component of the decentralized training process in Templar. Miners compute gradients during local training, process them through a series of transformations, and then share these compressed representations with validators and other miners via the Cloudflare R2 storage system.

flowchart TD
    subgraph "Miner Gradient Processing"
        A["Local Training Loop"] --> B["Gradient Computation"]
        B --> C["Weight Decay Application"]
        C --> D["Momentum Update"]
        D --> E["DCT Transform"]
        E --> F["Top-K Compression"]
        F --> G["Upload to R2 Storage"]
    end
    
    subgraph "Validator Processing"
        H["Download from R2 Storage"] --> I["Decompress Gradients"]
        I --> J["Evaluate Improvements"]
        J --> K["Set Weights on Chain"]
    end
    
    G --> H

Sources: neurons/miner.py:399-402 . neurons/validator.py:827-838 . src/tplr/neurons.py:40-124 .

The gradient processing system consists of several key components that work together to optimize, compress, and distribute model updates:

classDiagram
    class TransformDCT {
        +target_chunk: int
        +shape_dict: dict
        +f_dict: dict
        +b_dict: dict
        +encode(x): tensor
        +decode(x): tensor
    }
    
    class CompressDCT {
        +compress(x, topk): tuple
        +decompress(p, idx, val, xshape, totalk): tensor
        +batch_decompress(p, idx, val, xshape, totalk): tensor
    }
    
    class prepare_gradient_dict {
        <<function>>
    }
    
    prepare_gradient_dict --> TransformDCT: uses
    prepare_gradient_dict --> CompressDCT: uses

Sources: src/tplr/compress.py:35-178 . src/tplr/neurons.py:40-124 .

During training, miners apply weight decay and momentum updates to gradients before compression. This process helps stabilize training and improve convergence.

Weight decay helps prevent overfitting by regularizing model parameters. It is applied directly to the parameters before updating the momentum:

flowchart LR
    A["Parameter p"] --> B["p.data.mul_(1.0 - lr * weight_decay)"]
    B --> C["Parameter p'"]

Momentum helps accelerate gradients in the relevant direction and dampens oscillations. In Templar, momentum is:

  1. First scaled by a decay factor to reduce the influence of older gradients
  2. Then updated with the current gradient scaled by the learning rate
  3. For the first iteration, momentum is set directly to the gradient to avoid cold starts
flowchart TD
    A["First Iteration?"] -- "Yes" --> B["momentum = grad * lr"]
    A -- "No" --> C["momentum *= momentum_decay"]
    C --> D["momentum += grad * lr"]
    B --> E["Proceed to Compression"]
    D --> E

Sources: src/tplr/neurons.py:80-97 . neurons/miner.py:80-97 .

The Templar system uses Discrete Cosine Transform (DCT) based compression to efficiently share gradients across the network.

flowchart LR
    subgraph "Encoding"
        A["Original Gradient"] --> B["Split into Chunks"]
        B --> C["Apply DCT Transform"]
        C --> D["Keep Top-K Values"]
        D --> E["Compress to (indices, values)"]
    end
    
    subgraph "Decoding"
        F["(indices, values)"] --> G["Reconstruct Sparse Matrix"]
        G --> H["Apply Inverse DCT"]
        H --> I["Reassemble Chunks"]
        I --> J["Reconstructed Gradient"]
    end
    
    E -- "Transfer" --> F

The TransformDCT class handles the encoding and decoding process, while CompressDCT manages the selection of top-K components and creates compressed representations.

Sources: src/tplr/compress.py:35-178 . src/tplr/neurons.py:100-112 .

Two key hyperparameters control the compression process:

ParameterDescriptionDefault Value
target_chunkSize of chunks for DCT transform64
topk_compressionNumber of DCT coefficients to keep32

The combination of these parameters allows for significant compression while preserving essential gradient information.

Sources: hparams.json:12-13 . neurons/miner.py:132-134 .

The prepare_gradient_dict function is the central component that orchestrates the entire gradient processing pipeline:

flowchart TD
    A["Model Parameters & Gradients"] --> B["Apply Weight Decay"]
    B --> C["Update Momentum"]
    C --> D["Transform via DCT"]
    D --> E["Compress with Top-K Selection"]
    E --> F["Create Gradient Dictionary"]
    F --> G["Attach Metadata"]
    G --> H["Return for Communication"]

This function:

  1. Applies weight decay to model parameters
  2. Updates momentum tensors with current gradients
  3. Transforms and compresses using DCT
  4. Creates a dictionary containing compressed gradient data
  5. Attaches metadata like the current window and pages information

Sources: src/tplr/neurons.py:40-124 .

The TransformDCT class handles the mathematical transformation of gradients using Discrete Cosine Transform. It:

  1. Initializes by generating DCT basis matrices for all parameter shapes
  2. Encodes parameters by transforming them into the frequency domain
  3. Decodes parameters by transforming from frequency domain back to spatial domain

The DCT transformation concentrates most of the signal energy in fewer coefficients, allowing for effective compression by discarding less important components.

Sources: src/tplr/compress.py:35-120 .

The CompressDCT class handles the actual compression by:

  1. Taking DCT-transformed tensors and selecting the top-K components by magnitude
  2. Storing only the indices and values of these components
  3. Providing methods to reconstruct the original tensor from the compressed representation

Sources: src/tplr/compress.py:123-178 .

The flow of gradient data through the system illustrates how miners and validators interact through gradients:

sequenceDiagram
    participant Miner
    participant R2Storage as "R2 Gradient Bucket"
    participant Validator
    
    Miner->>Miner: Train model & compute gradients
    Miner->>Miner: Apply weight decay
    Miner->>Miner: Update momentum
    Miner->>Miner: Compress gradients (DCT + Top-K)
    Miner->>R2Storage: Upload compressed gradients
    
    Validator->>R2Storage: Download compressed gradients
    Validator->>Validator: Decompress gradients
    Validator->>Validator: Evaluate model improvements
    Validator->>Validator: Update miner scores

Each gradient upload contains:

  • Compressed parameter indices and values for each layer
  • Metadata about the current window and training pages
  • Timestamp information for proper synchronization

Sources: neurons/miner.py:415-426 . neurons/validator.py:827-859 .

Gradients are shared via the Comms system, which handles all data exchange in Templar:

flowchart TD
    A["Miner"] --> B["prepare_gradient_dict()"]
    B --> C["Processed Gradient Dict"]
    C --> D["comms.put()"]
    D --> E["R2 Storage"]
    
    F["Validator"] --> G["comms.gather()"]
    G --> E
    G --> H["Gathered Gradients"]
    H --> I["Evaluate Improvement"]

The gradient dictionary follows a structured format with special keys:

  • Parameter names with “idxs” suffix containing compressed indices
  • Parameter names with “vals” suffix containing compressed values
  • “metadata” key with window and page information

Sources: neurons/miner.py:399-426 . src/tplr/comms.py:324-373 .

The gradient processing includes special handling for the initial training iterations:

  1. In the first iteration, the momentum is set directly to the gradient scaled by the learning rate
  2. For the first 5 iterations, the system skips subtracting transmitted gradients from the momentum

This approach helps stabilize early training and enables faster initial convergence.

Sources: src/tplr/neurons.py:91-112 .

The DCT-based compression technique achieves significant reduction in communication overhead:

ParameterRaw Size (MB)Compressed (KB)Compression Ratio
Large models100s-1000s10s~100x-1000x
Target parametersFull modeltopk per tensorProportional to topk value
Retained information100%SignificantBased on frequency spectrum

The compression approach prioritizes the most important gradient components by leveraging the sparsity that emerges when transforming to the frequency domain using DCT.

Sources: hparams.json:12-13 . src/tplr/compress.py:100-112 .

After receiving and decompressing gradients from other miners, the miner applies these updates to the model:

  1. The gathered gradients are decompressed and transformed back to parameter space
  2. The decompressed gradients are assigned to the .grad attribute of model parameters
  3. The optimizer’s step() method is called to apply the updates using the configured learning rate
  4. The scheduler updates the learning rate for the next iteration

Sources: neurons/miner.py:559-600 .