R2 Storage
This page documents the Cloudflare R2 storage system used in the Templar framework for distributed data exchange. R2 Storage provides reliable object storage that enables efficient sharing of gradients, datasets, checkpoints, and aggregated model data between distributed nodes in the network.
Overview of R2 Storage in Templar
Section titled “Overview of R2 Storage in Templar”Templar uses Cloudflare R2 as its primary storage backend for several critical components of the distributed training ecosystem. R2 Storage serves as the communication medium for exchanging large volumes of data that cannot be efficiently transmitted through the blockchain directly.
flowchart TD
subgraph "R2 Storage System"
direction TB
R2["Cloudflare R2"]
subgraph "R2 Buckets"
GradBucket["Gradients Bucket"]
DataBucket["Dataset Bucket"]
AggBucket["Aggregator Bucket"]
end
R2 --- GradBucket
R2 --- DataBucket
R2 --- AggBucket
end
subgraph "Network Nodes"
Miners["Miners"]
Validators["Validators"]
Aggregator["Aggregation Server"]
end
Miners -- "Upload gradients" --> GradBucket
Miners -- "Download datasets" --> DataBucket
Validators -- "Download gradients" --> GradBucket
Validators -- "Download aggregated state" --> AggBucket
Aggregator -- "Upload aggregated state" --> AggBucket
Sources: src/tplr/config.py:27-135 , src/tplr/r2_dataset.py:33-45
Bucket Structure
Section titled “Bucket Structure”Templar uses three primary R2 buckets, each with distinct purposes:
-
Gradients Bucket: Stores gradient updates computed by miners. These are compressed via DCT transform to minimize storage requirements and transmission overhead.
-
Dataset Bucket: Contains training data in Parquet format, organized by collections and shards. Used by miners to load training data.
-
Aggregator Bucket: Stores aggregated model states that are collected from multiple miners’ contributions.
Each bucket contains specific file organizations and naming patterns based on its purpose:
flowchart TD
subgraph "Gradients Bucket Contents"
direction TB
GradFiles["gradient-{window}-{step}-{version}.pt\n(Compressed gradient data)"]
CheckFiles["checkpoint-{version}-{step}.pt\n(Model checkpoints)"]
StartFiles["start_window-{uid}-{step}.json\n(Window initializations)"]
end
subgraph "Dataset Bucket Contents"
direction TB
DataDir["HuggingFaceFW_fineweb-edu-score-2/"]
MetadataFile["_metadata.yaml\n(Dataset configuration)"]
ShardSizes["_shard_sizes.json\n(Row counts per shard)"]
ParquetFiles["{config_name}/{split}/train-{shard}.parquet\n(Actual training data)"]
DataDir --> MetadataFile
DataDir --> ShardSizes
DataDir --> ParquetFiles
end
subgraph "Aggregator Bucket Contents"
direction TB
AggFiles["gathers/{version}/{step}-{hash}.npz\n(Aggregated gradient data)"]
end
Sources: _metadata.yaml:1-453 , _shard_sizes.json:1-467 , scripts/cleanup_bucket.py:75-81 , scripts/delete_gather.py:66-87
Configuration and Authentication
Section titled “Configuration and Authentication”Templar accesses R2 through environment variables that provide authentication credentials and bucket information. Each bucket has separate read and write credentials to enforce proper access control.
Environment Variable Structure
Section titled “Environment Variable Structure”flowchart LR
subgraph "R2 Configuration Environment Variables"
direction TB
GradEnv["R2_GRADIENTS_*"]
DataEnv["R2_DATASET_*"]
AggEnv["R2_AGGREGATOR_*"]
subgraph "Per-Bucket Variables"
AccID["ACCOUNT_ID\n(R2 account identifier)"]
BucketName["BUCKET_NAME\n(Bucket name)"]
ReadKey["READ_ACCESS_KEY_ID\n(Read-only credentials)"]
ReadSecret["READ_SECRET_ACCESS_KEY"]
WriteKey["WRITE_ACCESS_KEY_ID\n(Write credentials)"]
WriteSecret["WRITE_SECRET_ACCESS_KEY"]
end
GradEnv --> AccID & BucketName & ReadKey & ReadSecret & WriteKey & WriteSecret
DataEnv --> AccID & BucketName & ReadKey & ReadSecret & WriteKey & WriteSecret
AggEnv --> AccID & BucketName & ReadKey & ReadSecret & WriteKey & WriteSecret
end
Authentication Flow
Section titled “Authentication Flow”sequenceDiagram
participant Node as "Templar Node"
participant Config as "BUCKET_SECRETS"
participant S3Client as "S3 Client"
participant R2 as "Cloudflare R2"
Node->>Config: Request credentials for bucket type
Config->>Node: Return credentials based on operation type (read/write)
Node->>S3Client: Initialize client with appropriate credentials
S3Client->>R2: Authenticate and establish connection
Note over Node,R2: For read operations (e.g., dataset loading)
Node->>S3Client: Request object
S3Client->>R2: GET object with read credentials
R2->>S3Client: Return object data
S3Client->>Node: Deliver object data
Note over Node,R2: For write operations (e.g., gradient uploads)
Node->>S3Client: Upload object
S3Client->>R2: PUT object with write credentials
R2->>S3Client: Confirm upload
S3Client->>Node: Return success status
Sources: src/tplr/config.py:28-134 , scripts/validate_r2_access.py:27-153
Multiple Dataset Endpoints Support
Section titled “Multiple Dataset Endpoints Support”The R2 configuration system supports multiple dataset endpoints for load balancing and improved reliability. This feature enables Templar to distribute dataset access across multiple R2 locations.
flowchart TD
subgraph "R2DatasetLoader Multiple Endpoint Handling"
direction TB
Config["BUCKET_SECRETS['dataset']['multiple']"]
RoundRobin["Round Robin Selection (_round_robin_index)"]
FSCache["Filesystem Cache (_fs_cache)"]
Config --> RoundRobin
RoundRobin --> |"Select endpoint"| FSCache
FSCache --> |"Cached connection"| S3FileSystem
end
subgraph "Dataset Access"
DataNode1["Dataset Bucket 1"]
DataNode2["Dataset Bucket 2"]
DataNode3["Dataset Bucket 3"]
end
S3FileSystem --> DataNode1
S3FileSystem --> DataNode2
S3FileSystem --> DataNode3
Sources: src/tplr/r2_dataset.py:339-369 , src/tplr/config.py:89-109
R2DatasetLoader
Section titled “R2DatasetLoader”The R2DatasetLoader class is a specialized component for loading training data from R2 storage. It’s designed to efficiently load, cache, and process Parquet files containing training text data.
Dataset Loading Process
Section titled “Dataset Loading Process”sequenceDiagram
participant App as "Templar Node"
participant Loader as "R2DatasetLoader"
participant Cache as "Local Cache"
participant R2 as "R2 Dataset Bucket"
App->>Loader: Request pages with seed
Loader->>Loader: Generate random page selection
Loader->>Cache: Check for cached metadata
alt Metadata not in cache
Loader->>R2: Fetch _metadata.yaml and _shard_sizes.json
R2->>Loader: Return metadata files
Loader->>Cache: Store metadata
end
loop For each requested page
Loader->>Loader: Compute exact shard and offset
Loader->>Cache: Check for cached ParquetFile
alt ParquetFile not in cache
Loader->>R2: Open Parquet file with retries
R2->>Loader: Return file handle
Loader->>Cache: Cache ParquetFile
end
Loader->>R2: Read specific row group
R2->>Loader: Return row data
Loader->>Loader: Extract and tokenize text
Loader->>Cache: Cache tokenized result
end
Loader->>App: Return processed text batches
Performance Optimizations
Section titled “Performance Optimizations”The R2DatasetLoader implements numerous optimizations to efficiently handle distributed dataset access:
-
Multi-level caching:
- Filesystem instance caching
- Parquet file caching
- Tokenized result caching
- Metadata caching
-
Distributed load balancing:
- Round-robin selection of multiple dataset endpoints
- Thread-safe access patterns
-
Resilient operation:
- Retries with exponential backoff
- Connection pooling
- Error handling for transient failures
-
Memory and bandwidth efficiency:
- Read specific row groups instead of entire files
- Parallel tokenization and processing
- Optimized buffer sizes
Sources: src/tplr/r2_dataset.py:33-594 , tests/test_r2_loader.py:64-220
Storage Management and Maintenance
Section titled “Storage Management and Maintenance”Templar includes utility scripts for maintaining R2 storage:
Bucket Maintenance Tools
Section titled “Bucket Maintenance Tools”-
cleanup_bucket.py: Deletes temporary files like checkpoints, gradients, and start window markers.
-
delete_gather.py: Removes aggregated gradient data from specific versions.
-
s3_manager.py: General-purpose R2 bucket management tool with features for:
- Deleting objects older than X hours
- Deleting objects with specific prefixes or suffixes
- Wiping buckets (with confirmation prompts)
- Supporting different credential sets for different buckets
flowchart TD
subgraph "R2 Storage Maintenance Tools"
direction TB
CleanupBucket["cleanup_bucket.py\n(Clean temporary files)"]
DeleteGather["delete_gather.py\n(Remove version-specific data)"]
S3Manager["s3_manager.py\n(General bucket management)"]
S3Manager -->|"--delete-old"| DeleteOld["Delete objects older than X hours"]
S3Manager -->|"--prefix"| DeletePrefix["Delete objects with specific prefix"]
S3Manager -->|"--suffix"| DeleteSuffix["Delete objects with specific suffix"]
S3Manager -->|"--wipe-bucket"| Wipe["Delete ALL objects (dangerous)"]
end
subgraph "Environment Configuration"
EnvVars["Environment Variables"]
end
EnvVars --> CleanupBucket & DeleteGather & S3Manager
Sources: scripts/cleanup_bucket.py:32-114 , scripts/s3_manager.py:15-441 , scripts/delete_gather.py:31-116 , scripts/clean_versions.py:29-120
Integration with Templar Components
Section titled “Integration with Templar Components”The R2 storage system integrates closely with the core components of the Templar framework:
flowchart TD
subgraph "R2 Storage Integration"
direction TB
R2["Cloudflare R2"]
GradBucket["Gradients Bucket"]
DataBucket["Dataset Bucket"]
AggBucket["Aggregator Bucket"]
end
subgraph "Miner Operations"
MinerTrain["Model Training"]
GradComp["Gradient Computation"]
GradCompress["Gradient Compression"]
DataLoad["R2DatasetLoader"]
end
subgraph "Validator Operations"
GradDecomp["Gradient Decompression"]
EvalGrad["Gradient Evaluation"]
SetWeights["Set Weights on Chain"]
end
subgraph "Aggregator Operations"
Gather["Gather Gradients"]
Aggregate["Aggregate Updates"]
StoreAgg["Store Aggregated State"]
end
DataBucket -->|"Load training data"| DataLoad
DataLoad -->|"Tokenized text"| MinerTrain
MinerTrain -->|"Model updates"| GradComp
GradComp -->|"Gradient tensors"| GradCompress
GradCompress -->|"Compressed gradients"| GradBucket
GradBucket -->|"Download gradients"| GradDecomp
GradDecomp -->|"Reconstructed gradients"| EvalGrad
EvalGrad -->|"Quality score"| SetWeights
GradBucket -->|"Download multiple gradients"| Gather
Gather -->|"Combined gradients"| Aggregate
Aggregate -->|"Aggregated state"| StoreAgg
StoreAgg -->|"Upload aggregated model"| AggBucket
Sources: src/tplr/config.py:27-135 , src/tplr/r2_dataset.py:33-45
Security Considerations
Section titled “Security Considerations”The R2 storage system implements several security measures:
-
Separate read and write credentials:
- Read-only credentials for operations that only need to fetch data
- Write credentials carefully restricted to components that need to modify data
-
Access validation:
- The
validate_r2_access.pyscript verifies access permissions - Tests for correct isolation between read and write permissions
- The
-
Environment variable management:
- Credentials stored in environment variables, not hard-coded
- Required variables checked at startup
Sources: scripts/validate_r2_access.py:25-153 , src/tplr/config.py:111-133
Troubleshooting and Maintenance
Section titled “Troubleshooting and Maintenance”Common R2 storage issues and their solutions:
| Issue | Possible Cause | Solution |
|---|---|---|
| Missing data files | Incorrect bucket configuration | Verify environment variables are correctly set |
| Access denied errors | Invalid or expired credentials | Update R2 tokens and verify with validate_r2_access.py |
| Slow data loading | Network congestion or high latency | Implement additional caching or add more dataset endpoints |
| Out of storage space | Accumulated gradient or checkpoint files | Run cleanup scripts to remove old objects |
| Timeout errors | Connection issues | Increase retry attempts and backoff in config |
Sources: scripts/validate_r2_access.py:25-153 , tests/test_r2_loader.py:311-438
Related Pages
Section titled “Related Pages”For information about how dataset loading works beyond just the R2 storage aspect, see Data Management.
For details on how gradient sharing occurs within the network, see Gradient Processing.
For information on checkpoint management using R2 storage, see Checkpoint Management.