Skip to content

Experiment Tracking

Relevant Source Files

This document describes how the Templar framework handles experiment tracking and monitoring using Weights & Biases (WandB) integration. It covers run initialization, version tracking, metrics logging, and hyperparameter management. For information about metrics logging to InfluxDB, see Metrics Logging, and for visualization with dashboards, see Dashboards.

Templar integrates deeply with Weights & Biases to provide comprehensive experiment tracking capabilities. This integration allows miners and validators to log model training metrics, hyperparameters, and version information in a structured manner, enabling detailed analysis of training progress across distributed nodes.

  • Run Management: Automatic creation and resumption of experiment runs
  • Version Tracking: Tracking metrics across different code versions
  • Hyperparameter Logging: Comprehensive tracking of model configurations
  • Metrics Namespacing: Versioned metrics organization for easy comparison

Sources: src/tplr/wandb.py:20-125

The experiment tracking system integrates with other Templar components as shown below:

flowchart TD
    subgraph "Experiment Tracking"
        WB["WandB Run"]
        VI["Version Information"]
        HT["Hyperparameter Tracking"]
        ML["Metrics Logging"]
    end

    subgraph "Templar Core"
        MN["Miner Node"]
        VL["Validator Node"]
        HP["Hyperparameters"]
        LG["Logging System"]
    end

    MN --> |"initialize_wandb()"| WB
    VL --> |"initialize_wandb()"| WB
    HP --> |"create_namespace()"| HT
    WB --> |"run.log()"| ML
    VI <--> |"version_history"| WB
    LG --> |"logger.info()"| MN
    LG --> |"logger.info()"| VL

Sources: src/tplr/wandb.py:20-125 , src/tplr/hparams.py:62-104

The initialize_wandb function handles WandB run creation and resumption. It automatically detects existing runs based on a persistent run ID stored in the file system.

sequenceDiagram
    participant Node as "Miner/Validator"
    participant WandB as "initialize_wandb()"
    participant FSIO as "File System"
    participant WAPI as "WandB API"
    
    Node->>WandB: "Call with run_prefix, uid, config"
    WandB->>FSIO: "Check for run_id file"
    
    alt Run ID exists
        FSIO->>WandB: "Return stored run_id"
        WandB->>WAPI: "Verify run exists"
        
        alt Run found in WandB
            WAPI->>WandB: "Confirm run"
            WandB->>Node: "Resume existing run"
        else Run not found
            WAPI->>WandB: "Run not found"
            WandB->>FSIO: "Delete invalid run_id file"
            WandB->>WAPI: "Create new run"
            WandB->>FSIO: "Store new run_id"
        end
    else No run ID
        WandB->>WAPI: "Create new run"
        WandB->>FSIO: "Store new run_id"
    end
    
    WandB->>Node: "Return configured run"

Sources: src/tplr/wandb.py:20-45 , src/tplr/wandb.py:120-124

Templar meticulously tracks software versions used for each experiment run. This allows for comparing metrics across different code versions and understanding the impact of code changes.

  1. Version History: Each run maintains a history of all versions that contributed to it
  2. Current Version: The active code version is always tracked
  3. Versioned Metrics: All metrics are automatically prefixed with the version that logged them
flowchart LR
    subgraph "Version Management"
        VH["version_history\nArray"]
        CV["current_version\nProperty"]
    end
    
    subgraph "Metrics Prefixing"
        OM["Original Metrics\n{loss: 0.1, acc: 0.9}"]
        MM["Modified Metrics\n{v0.1.1/loss: 0.1,\nv0.1.1/acc: 0.9,\nlatest/loss: 0.1,\nlatest/acc: 0.9}"]
        VM["Versioned Metrics\nCollection"]
    end
    
    I["initialize_wandb()"] --> |"Track versions"| VH
    I --> |"Set current"| CV
    OM --> |"log_with_version()"| MM
    MM --> |"Store in WandB"| VM
    VH --> |"Reference for\nmetrics analysis"| VM

Sources: src/tplr/wandb.py:64-68 , src/tplr/wandb.py:92-117

The WandB integration includes a custom logging wrapper that automatically adds version information to all metrics. This provides clear separation between metrics logged by different code versions.

  1. Original metrics are captured during training
  2. log_with_version transforms metrics by adding version prefixes
  3. Both version-specific (v{__version__}/metric) and latest (latest/metric) paths are maintained
  4. Step counting is handled automatically or can be explicitly provided

Sources: src/tplr/wandb.py:92-117

Templar uses a structured approach to hyperparameter management, with defaults that can be overridden by configuration files.

flowchart TD
    subgraph "Configuration Sources"
        DF["DEFAULT_HPARAMS\nBuilt-in defaults"]
        HF["hparams.json\nProject config"]
        LH["hparams-local-run.json\nLocal overrides"]
    end
    
    subgraph "Hyperparameter Processing"
        LHP["load_hparams()\nFunction"]
        CNP["create_namespace()\nFunction"]
        NS["SimpleNamespace\nObject"]
    end
    
    subgraph "Model Configuration"
        TK["Tokenizer\nConfiguration"]
        MC["LlamaConfig\nModel structure"]
    end
    
    HF --> |"Load JSON"| LHP
    LH --> |"Optional local\noverrides"| LHP
    DF --> |"Default values"| CNP
    LHP --> |"Merged params"| CNP
    CNP --> |"Initialize"| NS
    NS --> |"Configure"| TK
    NS --> |"Configure"| MC
    NS --> |"Log to WandB"| WB["WandB Config"]

The system supports special local configurations through hparams-local-run.json, which is useful for development and testing.

Sources: src/tplr/hparams.py:26-59 , src/tplr/hparams.py:107-145 , hparams-local-run.json:1-9

The following table shows the key configuration options used when initializing WandB runs:

ParameterDescriptionDefault Value
projectProject nameFrom config.project
entityTeam or user account”tplr” (or None for private)
idRun ID (for resuming)From stored run ID file
resumeResume policy”must” if run ID exists, “never” otherwise
nameRun name”{run_prefix}{uid}“
groupGrouping for related runsFrom parameter
job_typeType of job (miner/validator)From parameter
tagsVersion tags[“v{_version_}”]

Sources: src/tplr/wandb.py:46-61

When running a miner or validator node, the experiment tracking system is initialized with appropriate parameters that identify the node type and purpose. The system will automatically handle run resumption if the node restarts.

sequenceDiagram
    participant MV as "Miner/Validator"
    participant HP as "Hyperparameters"
    participant WI as "WandB Integration"
    participant WB as "WandB Server"
    
    MV->>HP: "load_hparams()"
    HP->>MV: "Return config namespace"
    MV->>WI: "initialize_wandb('miner_', uid, config, 'miners', 'miner')"
    WI->>WB: "Create/resume run"
    WB->>WI: "Return run object"
    WI->>MV: "Return configured run with custom log method"
    
    loop Training Loop
        MV->>MV: "Train for one step"
        MV->>WI: "run.log({'loss': loss, 'perplexity': ppl})"
        WI->>WI: "Add version prefixes"
        WI->>WB: "Log versioned metrics"
    end