R2 Storage

Relevant Source Files

LICENSE _metadata.yaml _shard_sizes.json docs/r2_bucket_management.md scripts/analyser.py scripts/clean_versions.py scripts/cleanup_bucket.py scripts/delete_gather.py scripts/get_gradients.py scripts/s3_manager.py scripts/validate_r2_access.py src/tplr/config.py src/tplr/r2_dataset.py tests/test_r2_loader.py

This page documents the Cloudflare R2 storage system used in the Templar framework for distributed data exchange. R2 Storage provides reliable object storage that enables efficient sharing of gradients, datasets, checkpoints, and aggregated model data between distributed nodes in the network.

Overview of R2 Storage in Templar

Templar uses Cloudflare R2 as its primary storage backend for several critical components of the distributed training ecosystem. R2 Storage serves as the communication medium for exchanging large volumes of data that cannot be efficiently transmitted through the blockchain directly.

flowchart TD
    subgraph "R2 Storage System"
        direction TB
        R2["Cloudflare R2"]
        
        subgraph "R2 Buckets"
            GradBucket["Gradients Bucket"]
            DataBucket["Dataset Bucket"]
            AggBucket["Aggregator Bucket"]
        end
        
        R2 --- GradBucket
        R2 --- DataBucket
        R2 --- AggBucket
    end
    
    subgraph "Network Nodes"
        Miners["Miners"]
        Validators["Validators"]
        Aggregator["Aggregation Server"]
    end
    
    Miners -- "Upload gradients" --> GradBucket
    Miners -- "Download datasets" --> DataBucket
    Validators -- "Download gradients" --> GradBucket
    Validators -- "Download aggregated state" --> AggBucket
    Aggregator -- "Upload aggregated state" --> AggBucket

Sources: src/tplr/config.py:27-135 , src/tplr/r2_dataset.py:33-45

Bucket Structure

Templar uses three primary R2 buckets, each with distinct purposes:

Gradients Bucket: Stores gradient updates computed by miners. These are compressed via DCT transform to minimize storage requirements and transmission overhead.
Dataset Bucket: Contains training data in Parquet format, organized by collections and shards. Used by miners to load training data.
Aggregator Bucket: Stores aggregated model states that are collected from multiple miners’ contributions.

Each bucket contains specific file organizations and naming patterns based on its purpose:

flowchart TD
    subgraph "Gradients Bucket Contents"
        direction TB
        GradFiles["gradient-{window}-{step}-{version}.pt\n(Compressed gradient data)"]
        CheckFiles["checkpoint-{version}-{step}.pt\n(Model checkpoints)"]
        StartFiles["start_window-{uid}-{step}.json\n(Window initializations)"]
    end
    
    subgraph "Dataset Bucket Contents"
        direction TB
        DataDir["HuggingFaceFW_fineweb-edu-score-2/"]
        MetadataFile["_metadata.yaml\n(Dataset configuration)"]
        ShardSizes["_shard_sizes.json\n(Row counts per shard)"]
        ParquetFiles["{config_name}/{split}/train-{shard}.parquet\n(Actual training data)"]
        
        DataDir --> MetadataFile
        DataDir --> ShardSizes
        DataDir --> ParquetFiles
    end
    
    subgraph "Aggregator Bucket Contents"
        direction TB
        AggFiles["gathers/{version}/{step}-{hash}.npz\n(Aggregated gradient data)"]
    end

Sources: _metadata.yaml:1-453 , _shard_sizes.json:1-467 , scripts/cleanup_bucket.py:75-81 , scripts/delete_gather.py:66-87

Configuration and Authentication

Templar accesses R2 through environment variables that provide authentication credentials and bucket information. Each bucket has separate read and write credentials to enforce proper access control.

Environment Variable Structure

flowchart LR
    subgraph "R2 Configuration Environment Variables"
        direction TB
        GradEnv["R2_GRADIENTS_*"]
        DataEnv["R2_DATASET_*"]
        AggEnv["R2_AGGREGATOR_*"]
        
        subgraph "Per-Bucket Variables"
            AccID["ACCOUNT_ID\n(R2 account identifier)"]
            BucketName["BUCKET_NAME\n(Bucket name)"]
            ReadKey["READ_ACCESS_KEY_ID\n(Read-only credentials)"]
            ReadSecret["READ_SECRET_ACCESS_KEY"]
            WriteKey["WRITE_ACCESS_KEY_ID\n(Write credentials)"]
            WriteSecret["WRITE_SECRET_ACCESS_KEY"]
        end
        
        GradEnv --> AccID & BucketName & ReadKey & ReadSecret & WriteKey & WriteSecret
        DataEnv --> AccID & BucketName & ReadKey & ReadSecret & WriteKey & WriteSecret
        AggEnv --> AccID & BucketName & ReadKey & ReadSecret & WriteKey & WriteSecret
    end

Authentication Flow

sequenceDiagram
    participant Node as "Templar Node"
    participant Config as "BUCKET_SECRETS"
    participant S3Client as "S3 Client"
    participant R2 as "Cloudflare R2"
    
    Node->>Config: Request credentials for bucket type
    Config->>Node: Return credentials based on operation type (read/write)
    
    Node->>S3Client: Initialize client with appropriate credentials
    S3Client->>R2: Authenticate and establish connection
    
    Note over Node,R2: For read operations (e.g., dataset loading)
    Node->>S3Client: Request object
    S3Client->>R2: GET object with read credentials
    R2->>S3Client: Return object data
    S3Client->>Node: Deliver object data
    
    Note over Node,R2: For write operations (e.g., gradient uploads)
    Node->>S3Client: Upload object
    S3Client->>R2: PUT object with write credentials
    R2->>S3Client: Confirm upload
    S3Client->>Node: Return success status

Sources: src/tplr/config.py:28-134 , scripts/validate_r2_access.py:27-153

Multiple Dataset Endpoints Support

The R2 configuration system supports multiple dataset endpoints for load balancing and improved reliability. This feature enables Templar to distribute dataset access across multiple R2 locations.

flowchart TD
    subgraph "R2DatasetLoader Multiple Endpoint Handling"
        direction TB
        Config["BUCKET_SECRETS['dataset']['multiple']"]
        RoundRobin["Round Robin Selection (_round_robin_index)"]
        FSCache["Filesystem Cache (_fs_cache)"]
        
        Config --> RoundRobin
        RoundRobin --> |"Select endpoint"| FSCache
        FSCache --> |"Cached connection"| S3FileSystem
    end
    
    subgraph "Dataset Access"
        DataNode1["Dataset Bucket 1"]
        DataNode2["Dataset Bucket 2"]
        DataNode3["Dataset Bucket 3"]
    end
    
    S3FileSystem --> DataNode1
    S3FileSystem --> DataNode2
    S3FileSystem --> DataNode3

Sources: src/tplr/r2_dataset.py:339-369 , src/tplr/config.py:89-109

R2DatasetLoader

The R2DatasetLoader class is a specialized component for loading training data from R2 storage. It’s designed to efficiently load, cache, and process Parquet files containing training text data.

Dataset Loading Process

sequenceDiagram
    participant App as "Templar Node"
    participant Loader as "R2DatasetLoader"
    participant Cache as "Local Cache"
    participant R2 as "R2 Dataset Bucket"
    
    App->>Loader: Request pages with seed
    Loader->>Loader: Generate random page selection
    
    Loader->>Cache: Check for cached metadata
    alt Metadata not in cache
        Loader->>R2: Fetch _metadata.yaml and _shard_sizes.json
        R2->>Loader: Return metadata files
        Loader->>Cache: Store metadata
    end
    
    loop For each requested page
        Loader->>Loader: Compute exact shard and offset
        Loader->>Cache: Check for cached ParquetFile
        
        alt ParquetFile not in cache
            Loader->>R2: Open Parquet file with retries
            R2->>Loader: Return file handle
            Loader->>Cache: Cache ParquetFile
        end
        
        Loader->>R2: Read specific row group
        R2->>Loader: Return row data
        Loader->>Loader: Extract and tokenize text
        Loader->>Cache: Cache tokenized result
    end
    
    Loader->>App: Return processed text batches

Performance Optimizations

The R2DatasetLoader implements numerous optimizations to efficiently handle distributed dataset access:

Multi-level caching:
- Filesystem instance caching
- Parquet file caching
- Tokenized result caching
- Metadata caching
Distributed load balancing:
- Round-robin selection of multiple dataset endpoints
- Thread-safe access patterns
Resilient operation:
- Retries with exponential backoff
- Connection pooling
- Error handling for transient failures
Memory and bandwidth efficiency:
- Read specific row groups instead of entire files
- Parallel tokenization and processing
- Optimized buffer sizes

Sources: src/tplr/r2_dataset.py:33-594 , tests/test_r2_loader.py:64-220

Storage Management and Maintenance

Templar includes utility scripts for maintaining R2 storage:

Bucket Maintenance Tools

cleanup_bucket.py: Deletes temporary files like checkpoints, gradients, and start window markers.
delete_gather.py: Removes aggregated gradient data from specific versions.
s3_manager.py: General-purpose R2 bucket management tool with features for:
- Deleting objects older than X hours
- Deleting objects with specific prefixes or suffixes
- Wiping buckets (with confirmation prompts)
- Supporting different credential sets for different buckets

flowchart TD
    subgraph "R2 Storage Maintenance Tools"
        direction TB
        CleanupBucket["cleanup_bucket.py\n(Clean temporary files)"]
        DeleteGather["delete_gather.py\n(Remove version-specific data)"]
        S3Manager["s3_manager.py\n(General bucket management)"]
        
        S3Manager -->|"--delete-old"| DeleteOld["Delete objects older than X hours"]
        S3Manager -->|"--prefix"| DeletePrefix["Delete objects with specific prefix"]
        S3Manager -->|"--suffix"| DeleteSuffix["Delete objects with specific suffix"]
        S3Manager -->|"--wipe-bucket"| Wipe["Delete ALL objects (dangerous)"]
    end
    
    subgraph "Environment Configuration"
        EnvVars["Environment Variables"]
    end
    
    EnvVars --> CleanupBucket & DeleteGather & S3Manager

Sources: scripts/cleanup_bucket.py:32-114 , scripts/s3_manager.py:15-441 , scripts/delete_gather.py:31-116 , scripts/clean_versions.py:29-120

Integration with Templar Components

The R2 storage system integrates closely with the core components of the Templar framework:

flowchart TD
    subgraph "R2 Storage Integration"
        direction TB
        R2["Cloudflare R2"]
        GradBucket["Gradients Bucket"]
        DataBucket["Dataset Bucket"] 
        AggBucket["Aggregator Bucket"]
    end
    
    subgraph "Miner Operations"
        MinerTrain["Model Training"]
        GradComp["Gradient Computation"]
        GradCompress["Gradient Compression"]
        DataLoad["R2DatasetLoader"]
    end
    
    subgraph "Validator Operations"
        GradDecomp["Gradient Decompression"]
        EvalGrad["Gradient Evaluation"]
        SetWeights["Set Weights on Chain"]
    end
    
    subgraph "Aggregator Operations"
        Gather["Gather Gradients"]
        Aggregate["Aggregate Updates"]
        StoreAgg["Store Aggregated State"]
    end
    
    DataBucket -->|"Load training data"| DataLoad
    DataLoad -->|"Tokenized text"| MinerTrain
    MinerTrain -->|"Model updates"| GradComp
    GradComp -->|"Gradient tensors"| GradCompress
    GradCompress -->|"Compressed gradients"| GradBucket
    
    GradBucket -->|"Download gradients"| GradDecomp
    GradDecomp -->|"Reconstructed gradients"| EvalGrad
    EvalGrad -->|"Quality score"| SetWeights
    
    GradBucket -->|"Download multiple gradients"| Gather
    Gather -->|"Combined gradients"| Aggregate
    Aggregate -->|"Aggregated state"| StoreAgg
    StoreAgg -->|"Upload aggregated model"| AggBucket

Sources: src/tplr/config.py:27-135 , src/tplr/r2_dataset.py:33-45

Security Considerations

The R2 storage system implements several security measures:

Separate read and write credentials:
- Read-only credentials for operations that only need to fetch data
- Write credentials carefully restricted to components that need to modify data
Access validation:
- The validate_r2_access.py script verifies access permissions
- Tests for correct isolation between read and write permissions
Environment variable management:
- Credentials stored in environment variables, not hard-coded
- Required variables checked at startup

Sources: scripts/validate_r2_access.py:25-153 , src/tplr/config.py:111-133

Troubleshooting and Maintenance

Common R2 storage issues and their solutions:

Issue	Possible Cause	Solution
Missing data files	Incorrect bucket configuration	Verify environment variables are correctly set
Access denied errors	Invalid or expired credentials	Update R2 tokens and verify with validate_r2_access.py
Slow data loading	Network congestion or high latency	Implement additional caching or add more dataset endpoints
Out of storage space	Accumulated gradient or checkpoint files	Run cleanup scripts to remove old objects
Timeout errors	Connection issues	Increase retry attempts and backoff in config

Sources: scripts/validate_r2_access.py:25-153 , tests/test_r2_loader.py:311-438

For information about how dataset loading works beyond just the R2 storage aspect, see Data Management.

For details on how gradient sharing occurs within the network, see Gradient Processing.

For information on checkpoint management using R2 storage, see Checkpoint Management.