Miners
This document provides a detailed explanation of miners in the Templar decentralized training framework. Miners are computational nodes responsible for training language models on assigned data and sharing their gradients with the network. For information about how validators evaluate miners’ contributions, see Validators.
Miner Architecture and Positioning
Section titled “Miner Architecture and Positioning”Miners are a core component of Templar’s distributed training infrastructure. They work alongside validators and the aggregation server to collaboratively train large language models across distributed nodes.
Miner Position in Templar Architecture
Section titled “Miner Position in Templar Architecture”graph TD
subgraph "Bittensor Network"
BT["bt.subtensor"]
MG["metagraph"]
end
subgraph "Miner Node"
ML["LlamaForCausalLM Model"]
OP["SGD Optimizer"]
TR["TransformDCT"]
CP["CompressDCT"]
CO["Comms Module"]
SC["LR Scheduler"]
WL["Wallet"]
end
subgraph "External Components"
R2["R2 Storage"]
DS["Dataset Loader"]
CK["Checkpoints"]
PE["Peer Miners"]
VL["Validators"]
end
BT <--"Block events"--> CO
WL --"Authentication"--> BT
ML --"Forward/Backward pass"--> OP
OP --"Gradients"--> TR
TR --"Encoded gradients"--> CP
CP --"Compressed gradients"--> CO
CO --"Upload gradients"--> R2
R2 --"Peer gradients"--> CO
DS --"Training data"--> ML
CO --"Load/Save model state"--> CK
SC --"Update learning rate"--> OP
R2 --"Evaluate gradients"--> VL
VL --"Set weights"--> BT
BT --"Rewards"--> WL
Sources: neurons/miner.py:65-226
Miner Implementation
Section titled “Miner Implementation”The miner implementation is structured around the Miner class, which coordinates training, gradient processing, and communication.
Key Components
Section titled “Key Components”- Model:
LlamaForCausalLM- The foundational language model being trained - Optimizer:
SGDwith momentum for updating model parameters - Transformers:
TransformDCTandCompressDCTfor gradient compression - Communications:
Commsmodule for gradient sharing via R2 storage - Scheduler: Learning rate scheduler combining warm-up and cosine annealing
Sources: neurons/miner.py:107-226
Miner Lifecycle
Section titled “Miner Lifecycle”The following sequence diagram illustrates the complete lifecycle of a miner during operation:
sequenceDiagram
participant M as Miner
participant R2 as R2 Storage
participant P as Peers
participant B as Blockchain
participant AS as Aggregation Server
Note over M: Initialization
M->>B: Register with metagraph
M->>AS: Get start_window
Note over M: Synchronization
M->>R2: Load latest checkpoint
alt checkpoint found
R2->>M: Return model, optimizer, momentum
else no checkpoint
Note over M: Initialize model from scratch
end
M->>AS: Catch up with aggregator (if needed)
loop For each window
B->>M: Block listener detects new window
M->>B: Update peers (tplr.neurons.update_peers)
M->>R2: Load dataset pages (R2DatasetLoader.next_pages)
Note over M: Training
loop For each batch
Note over M: Forward pass (compute loss)
Note over M: Backward pass (compute gradients)
end
Note over M: Gradient Processing
Note over M: Apply momentum update
Note over M: Compress gradients with DCT and top-k
M->>R2: Upload compressed gradients (comms.put)
Note over M: Peer Gradient Exchange
M->>R2: Request peer gradients (comms.gather)
R2->>M: Return compressed peer gradients
Note over M: Model Update
Note over M: Decompress peer gradients
Note over M: Apply aggregated gradients
Note over M: Step optimizer and scheduler
alt global_step % checkpoint_frequency == 0
M->>R2: Save checkpoint
end
Note over M: Metrics & Logging
M->>WandB: Log metrics
M->>InfluxDB: Log system metrics
end
Sources: neurons/miner.py:229-755
Training Process
Section titled “Training Process”The miner’s primary responsibility is to train the language model on assigned data and share the resulting gradients. Here’s how the training process works:
Dataset Assignment
Section titled “Dataset Assignment”For each window, miners receive specific dataset pages:
pages = await tplr.r2_dataset.R2DatasetLoader.next_pages( offset=step_window * self.hparams.pages_per_window, n_pages=self.hparams.pages_per_window, seed=self.uid, # Each miner gets unique data based on UID)The data assignment is deterministic - miners with the same UID will always receive the same pages for a given window number.
Sources: neurons/miner.py:339-355
Gradient Computation and Accumulation
Section titled “Gradient Computation and Accumulation”Miners process batches of data, compute loss, and accumulate gradients:
for i, batch in enumerate(loader): input_ids = torch.tensor(batch, dtype=torch.long).to(self.model.device) tokens_this_batch = input_ids.numel() window_tokens += tokens_this_batch labels = input_ids.clone() labels = torch.where( labels == self.tokenizer.pad_token_id, -100, labels )
with autocast(device_type=self.model.device.type, dtype=torch.bfloat16): outputs = self.model(input_ids=input_ids, labels=labels)
total_loss += outputs.loss.item() outputs.loss.backward() n_batches += 1Sources: neurons/miner.py:360-384
Gradient Processing and Sharing
Section titled “Gradient Processing and Sharing”After computing gradients, miners process and share them with the network:
Gradient Processing Flow
Section titled “Gradient Processing Flow”flowchart TD
GR["Raw Gradients"] --> MU["momentum = γ*momentum + η*gradient"]
MU --> EN["transformer.encode() - DCT Transform"]
EN --> CP["compressor.compress() - Top-K Selection"]
CP --> UP["comms.put() - Upload to R2"]
UP --> PR["comms.gather() - Get Peer Gradients"]
PR --> DC["compressor.batch_decompress() - Reconstruct"]
DC --> AG["transformer.decode() - Inverse DCT"]
AG --> MO["p.grad.copy_(new_grad) - Apply Gradients"]
MO --> SG["p.grad.sign_() - Use Only Direction"]
SG --> OP["optimizer.step() - Update Model"]
OP --> SC["scheduler.step() - Update LR"]
Sources: neurons/miner.py:399-402 , neurons/miner.py:560-601
Compression Techniques
Section titled “Compression Techniques”Miners use two key techniques to compress gradients efficiently:
- DCT Transformation: Converts gradients to frequency domain using Discrete Cosine Transform
- Top-K Selection: Only keeps the K most significant coefficients, drastically reducing data size
This compression is essential for efficient sharing over the internet, allowing miners to exchange gradient information without prohibitive bandwidth requirements.
Sources: neurons/miner.py:131-147
Communication System
Section titled “Communication System”The communication system enables miners to interact with R2 storage, validators, and other miners:
Gradient Exchange via R2
Section titled “Gradient Exchange via R2”# Upload own gradientsput_completion_time = await self.comms.put( state_dict=processed_state_dict, uid=str(self.uid), window=step_window, key="gradient", global_step=self.global_step, local=False, stale_retention=100,)
# Gather gradients from peersgather_result = await self.comms.gather( my_uid=self.uid, uids=self.comms.peers, window=step_window, key="gradient", timeout=35, device="cpu", local=False, stale_retention=100, totalks=self.totalks, time_min=time_min, time_max=time_max,)Sources: neurons/miner.py:417-427 , neurons/miner.py:489-501
Model Synchronization
Section titled “Model Synchronization”Miners synchronize their model with the network using checkpoints:
- Initial Synchronization: When starting, miners load the latest checkpoint
- Catch-up Procedure: If behind, miners catch up with the aggregation server
- Periodic Checkpoints: Save model state every
checkpoint_frequencywindows
if self.global_step % self.hparams.checkpoint_frequency == 0: asyncio.create_task( self.comms.save_checkpoint( model=self.model, optimizer=self.optimizer, scheduler=self.scheduler, momentum=self.momentum, global_step=self.global_step, current_window=self.current_window, start_window=self.start_window, ) )Sources: neurons/miner.py:732-747
Configuration Parameters
Section titled “Configuration Parameters”Miners are configured through both command-line parameters and hyperparameter settings:
Command-Line Parameters
Section titled “Command-Line Parameters”| Parameter | Description | Default |
|---|---|---|
--netuid | Bittensor network UID | 268 |
--device | Computing device | ”cuda” |
--debug | Enable debug logging | False |
--trace | Enable trace logging | False |
--test | Test mode (use all peers) | False |
--local | Use toy model for local testing | False |
Sources: neurons/miner.py:67-106
Hyperparameters
Section titled “Hyperparameters”Key hyperparameters from hparams.json:
| Parameter | Value | Description |
|---|---|---|
sequence_length | 2048 | Maximum sequence length for training |
pages_per_window | 6 | Number of data pages per window |
batch_size | 6 | Batch size for training |
learning_rate | 4e-4 | Initial learning rate |
blocks_per_window | 7 | Number of blockchain blocks per window |
momentum_decay | 0.999 | Decay rate for momentum |
topk_compression | 32 | Top-K value for gradient compression |
target_chunk | 64 | Chunk size for DCT transform |
checkpoint_frequency | 100 | Windows between checkpoint saves |
Sources: hparams.json:1-53
Performance Monitoring
Section titled “Performance Monitoring”Miners track various metrics to monitor performance:
self.wandb.log( { # Training metrics "miner/loss": total_loss / n_batches if n_batches > 0 else 0, "miner/tokens_per_sec": tokens_per_sec, "miner/batch_duration": duration, "miner/total_tokens": self.total_tokens_processed, "miner/batch_tokens": window_tokens, "miner/global_step": self.global_step, # Resource metrics "miner/gpu_memory_allocated": torch.cuda.memory_allocated() / 1024**2, "miner/gpu_memory_cached": torch.cuda.memory_reserved() / 1024**2, # Network metrics "miner/gather_peers": len(self.comms.peers), "miner/effective_batch_size": len(self.comms.peers) * self.hparams.batch_size, # Optimization metrics "miner/learning_rate": self.scheduler.get_last_lr()[0], # Gradient statistics "miner/mean_grad_norm": sum(grad_norms) / len(grad_norms) if grad_norms else 0, "miner/max_grad_norm": max(grad_norms) if grad_norms else 0, "miner/min_grad_norm": min(grad_norms) if grad_norms else 0, "miner/grad_norm_std": torch.tensor(grad_norms).std().item() if grad_norms else 0, "miner/mean_weight_norm": sum(weight_norms) / len(weight_norms), "miner/mean_momentum_norm": sum(momentum_norms) / len(momentum_norms), }, step=self.global_step,)Sources: neurons/miner.py:518-552
Hardware Requirements
Section titled “Hardware Requirements”To run a miner effectively, you need:
- GPU: NVIDIA H100 with 80GB VRAM recommended
- Storage: 100GB+ for model and data
- Network: Stable internet connection with good bandwidth
Sources: docs/miner.md:369-373
Integration with Validators
Section titled “Integration with Validators”Miners work in tandem with validators, who:
- Gather and evaluate miners’ gradients
- Compute scores based on improvement in loss
- Set weights on the blockchain
- Determine reward distribution
For more details on validators and the evaluation process, see Validators and Weight Setting.
Sources: neurons/validator.py:489-516
Running a Miner
Section titled “Running a Miner”For detailed setup and running instructions, refer to the documentation in docs/miner.md . This includes:
- Installing dependencies
- Setting up R2 bucket credentials
- Configuring Bittensor wallet
- Running via Docker or directly with Python
- Monitoring performance and troubleshooting
Sources: docs/miner.md:32-302