Monitoring and Telemetry

Relevant Source Files

hparams-local-run.json src/tplr/dataset.py src/tplr/hparams.py src/tplr/logging.py src/tplr/metrics.py src/tplr/schemas.py src/tplr/wandb.py telemetry/ansible/host_vars/grafana_prod.yml telemetry/simulator/testing_metrics.py tests/conftest.py tests/test_metrics_logger.py

This document describes the monitoring and telemetry system implemented in the Templar framework. It covers the metrics collection, logging infrastructure, and visualization tools that provide observability into the distributed training process. For information about deploying the monitoring services, see Deployment.

Overview

Templar’s monitoring and telemetry system provides comprehensive visibility into the distributed training process through three main components:

Metrics Logging - Collection and storage of time-series metrics in InfluxDB
Experiment Tracking - Tracking of training runs with Weights & Biases (WandB)
Structured Logging - Centralized logging with Loki for operational visibility

These components work together to provide a complete view of the training process, from model performance metrics to system resource utilization.

flowchart TD
    subgraph "Templar Nodes"
        M["Miners"]
        V["Validators"]
        A["Aggregator"]
        E["Evaluator"]
    end

    subgraph "Telemetry Infrastructure"
        I["InfluxDB"]
        W["Weights & Biases"]
        L["Loki"]
        G["Grafana"]
    end

    M -->|"Performance & System Metrics"| I
    V -->|"Performance & System Metrics"| I
    A -->|"Performance & System Metrics"| I
    E -->|"Performance & System Metrics"| I

    M -->|"Experiment Tracking"| W
    V -->|"Experiment Tracking"| W
    A -->|"Experiment Tracking"| W
    E -->|"Experiment Tracking"| W

    M -->|"Structured Logs"| L
    V -->|"Structured Logs"| L
    A -->|"Structured Logs"| L
    E -->|"Structured Logs"| L

    I -->|"Data Source"| G
    L -->|"Data Source"| G

Sources: src/tplr/metrics.py , src/tplr/wandb.py , src/tplr/logging.py

Metrics Logging with InfluxDB

Templar uses a custom MetricsLogger class to collect and store metrics in InfluxDB. This system provides insights into training performance, system resource utilization, and gradient processing efficiency.

Architecture

The metrics logging system is designed to be asynchronous and non-blocking to minimize impact on the training process. It uses a queue-based architecture to buffer metrics before sending them to InfluxDB.

sequenceDiagram
    participant Node as "Templar Node"
    participant Logger as "MetricsLogger"
    participant Queue as "Async Queue"
    participant Thread as "Background Thread"
    participant InfluxDB as "InfluxDB"

    Node->>Logger: log(measurement, tags, fields)
    Logger->>Logger: Process fields & add metadata
    Logger->>Queue: Enqueue Point
    Note over Queue,Thread: Non-blocking operation
    Thread->>Queue: Dequeue Point
    Thread->>InfluxDB: Write point (retry if needed)

Sources: src/tplr/metrics.py:182-226

MetricsLogger Class

The MetricsLogger class provides a simple interface for logging metrics while handling the complexity of asynchronous communication with InfluxDB.

Key features:

Automatic collection of system and GPU metrics
Statistical processing of list values (mean, min, max, median)
Tagging with runtime information (version, node role, etc.)
Configurable batching and retry logic

classDiagram
    class MetricsLogger {
        +client: InfluxDBClient
        +write_api: WriteApi
        +prefix: str
        +uid: str
        +role: str
        +version: str
        +runtime_id: str
        +log(measurement, tags, fields)
        +process_value(value)
        -_add_system_metrics(fields)
        -_add_gpu_metrics(tags, fields)
        -_add_standard_tags(point)
        -_add_config_tags(point)
        -_write_point(point)
    }

Sources: src/tplr/metrics.py:82-321

Configuration

The MetricsLogger can be configured through environment variables and constructor parameters:

Parameter	Environment Variable	Default	Description
host	INFLUXDB_HOST	AWS InfluxDB endpoint	InfluxDB server hostname
port	INFLUXDB_PORT	8086	InfluxDB server port
database	INFLUXDB_DATABASE	”tplr”	InfluxDB bucket/database name
token	INFLUXDB_TOKEN	Fallback token	Authentication token
org	INFLUXDB_ORG	”tplr”	InfluxDB organization
prefix	-	""	Prefix for all metrics
role	-	""	Node role (miner, validator, etc.)

Sources: src/tplr/metrics.py:45-59 , src/tplr/metrics.py:88-105

System and GPU Metrics

The metrics logging system automatically collects system and GPU metrics when requested:

System Metrics:

CPU usage percentage
Memory usage (used and total)

GPU Metrics:

GPU memory allocated
GPU memory cached/reserved
GPU memory segments
Total GPU memory used

# Example usage
metrics_logger.log(
    measurement="training_step",
    tags={"uid": 42},
    fields={"loss": 0.75, "learning_rate": 0.001},
    with_system_metrics=True,
    with_gpu_metrics=True
)

Sources: src/tplr/metrics.py:324-359

Experiment Tracking with Weights & Biases

Templar integrates with Weights & Biases (WandB) for experiment tracking and visualization. This integration provides a way to track training progress, compare different runs, and visualize model performance.

Integration Architecture

The WandB integration is designed to handle versioning and run resumption, ensuring that training runs are properly tracked even if they span multiple sessions or software versions.

flowchart TD
    subgraph "Templar Node"
        Init["initialize_wandb()"]
        Log["run.log()"]
    end

    subgraph "WandB"
        API["WandB API"]
        RunObj["Run Object"]
        Storage["Cloud Storage"]
    end

    Init -->|"Create/Resume Run"| API
    API -->|"Return Run Object"| RunObj
    Log -->|"Log metrics with version prefix"| RunObj
    RunObj -->|"Store metrics & artifacts"| Storage

Sources: src/tplr/wandb.py:20-125

Version Tracking

A unique feature of Templar’s WandB integration is version tracking. Metrics are automatically prefixed with the current software version, allowing for easier comparison of performance across different software versions.

v0.1.0/loss
v0.1.0/step
latest/loss
latest/step

The system also maintains a version history in the run configuration, enabling tracking of when a run was executed with different versions of the software.

Sources: src/tplr/wandb.py:62-68 , src/tplr/wandb.py:104-116

Run Resumption

The integration supports run resumption by maintaining a record of run IDs and checking for existing runs when initializing WandB. This allows for seamless continuation of training across multiple sessions.

# Run ID is stored in a file
run_id_file = os.path.join(wandb_dir, f"wandb_run_id_{run_prefix}{uid}.txt")

Sources: src/tplr/wandb.py:28-43 , src/tplr/wandb.py:121-123

Structured Logging with Loki

Templar uses a structured logging system with Loki integration for centralized log collection and analysis. This system provides operational visibility into the distributed training process.

Architecture

The logging system is built on Python’s standard logging module with enhancements for structured logging and Loki integration. It uses a queue-based architecture for asynchronous log transmission.

flowchart TD
    subgraph "Templar Node"
        Log["logger.info()"]
        QH["QueueHandler"]
        Rich["RichHandler"]
    end

    subgraph "Background Processing"
        QL["QueueListener"]
        LH["LokiHandler"]
    end

    subgraph "Infrastructure"
        Loki["Loki Server"]
        Console["Console Output"]
    end

    Log -->|"Log Record"| QH
    Log -->|"Log Record"| Rich
    QH -->|"Enqueue"| QL
    QL -->|"Dequeue"| LH
    Rich -->|"Format & Display"| Console
    LH -->|"HTTP POST"| Loki

Sources: src/tplr/logging.py:149-287

Context-based Logging

The logging system supports context-based logging, allowing additional metadata to be attached to log messages. This metadata can include information like trace IDs, batch identifiers, or custom metrics.

# Example context-based logging
log_with_context(
    'info',
    'Processing batch',
    batch_size=32,
    batch_id='abc123'
)

Sources: src/tplr/logging.py:290-309

Log Filtering

The logging system includes custom filters to reduce noise and focus on relevant information. For example, it filters out common subtensor warnings that are not actionable.

class NoSubtensorWarning(logging.Filter):
    def filter(self, record: logging.LogRecord) -> bool:
        # Return False if the record contains the undesired subtensor warning
        return (
            "Verify your local subtensor is running on port" not in record.getMessage()
        )

Sources: src/tplr/logging.py:91-96

Dashboards and Visualization

Templar uses Grafana for dashboard visualization of metrics stored in InfluxDB and logs stored in Loki. This provides a unified view of system performance and health.

Grafana Configuration

Grafana is configured to connect to InfluxDB and Loki as data sources. The configuration is managed through Ansible for automated deployment.

flowchart TD
    subgraph "Data Sources"
        I["InfluxDB"]
        L["Loki"]
    end

    subgraph "Grafana"
        G["Grafana Server"]
        D1["Training Metrics Dashboard"]
        D2["System Metrics Dashboard"]
        D3["Log Analytics Dashboard"]
    end

    I -->|"Time Series Data"| G
    L -->|"Log Data"| G
    G --- D1
    G --- D2
    G --- D3

Sources: telemetry/ansible/host_vars/grafana_prod.yml

Available Metrics

The monitoring system collects a wide range of metrics that can be visualized in Grafana:

Category	Metrics
Training	Loss, learning rate, gradient norms
System	CPU usage, memory usage
GPU	Memory allocated, memory cached
Network	Active peers, sync status
Performance	Gradient compression ratio, training throughput

Sources: src/tplr/metrics.py:182-226 , tests/test_metrics_logger.py:315-331

Integration in the Templar Ecosystem

The monitoring and telemetry system is integrated throughout the Templar ecosystem, providing visibility into all components of the distributed training process.

flowchart TD
    subgraph "Templar Framework"
        M["Miner"]
        V["Validator"]
        A["Aggregator"]
        C["Comms"]
        E["Evaluator"]
    end

    subgraph "Metrics Collection"
        ML["MetricsLogger"]
        W["WandB Integration"]
        Log["Structured Logger"]
    end

    subgraph "Infrastructure"
        I["InfluxDB"]
        WB["Weights & Biases"]
        L["Loki"]
        G["Grafana"]
    end

    M -->|"Training Metrics"| ML
    V -->|"Validation Metrics"| ML
    A -->|"Aggregation Metrics"| ML
    C -->|"Communication Metrics"| ML
    E -->|"Evaluation Metrics"| ML

    M -->|"Experiment Data"| W
    V -->|"Experiment Data"| W
    A -->|"Experiment Data"| W
    E -->|"Experiment Data"| W

    M -->|"Operational Logs"| Log
    V -->|"Operational Logs"| Log
    A -->|"Operational Logs"| Log
    C -->|"Operational Logs"| Log
    E -->|"Operational Logs"| Log

    ML -->|"Time Series Data"| I
    W -->|"Run Data"| WB
    Log -->|"Structured Logs"| L

    I -->|"Data Source"| G
    L -->|"Data Source"| G

Sources: src/tplr/metrics.py , src/tplr/wandb.py , src/tplr/logging.py

Usage Examples

Logging Training Metrics

# Initialize MetricsLogger
metrics_logger = MetricsLogger(
    prefix="miner",
    uid="123",
    role="miner",
    group="finney",
    job_type="training"
)

# Log training metrics
metrics_logger.log(
    measurement="training_step",
    tags={"batch_id": batch_id},
    fields={
        "loss": loss.item(),
        "learning_rate": current_lr,
        "gradient_norm": gradient_norm
    },
    with_system_metrics=True,
    with_gpu_metrics=True
)

Sources: src/tplr/metrics.py:88-105 , src/tplr/metrics.py:182-226

Initializing W&B for Experiment Tracking

# Initialize W&B
run = initialize_wandb(
    run_prefix="miner_",
    uid="123",
    config=config,
    group="finney",
    job_type="training"
)

# Log metrics to W&B
run.log({
    "loss": loss.item(),
    "learning_rate": current_lr,
    "gradient_norm": gradient_norm
})

Sources: src/tplr/wandb.py:20-125

Setting Up Structured Logging

# Setup Loki logger
logger = setup_loki_logger(
    service="miner",
    uid="123",
    version="0.1.0",
    environment="finney"
)

# Log with context
logger.log_with_context(
    "info",
    "Completed training step",
    step=current_step,
    loss=loss.item(),
    duration=step_duration
)

Sources: src/tplr/logging.py:149-287 , src/tplr/logging.py:290-309

Troubleshooting

If metrics are not appearing in Grafana dashboards:

Check that the INFLUXDB_TOKEN environment variable is set correctly
Verify that the InfluxDB host is reachable from the Templar nodes
Check the console logs for any errors related to metrics logging
Verify that Grafana is properly configured to connect to InfluxDB and Loki

If W&B integration is not working:

Check for the existence of a wandb_run_id_*.txt file in the wandb directory
Verify that the W&B API can be reached from the Templar nodes
Check that the project name is correctly configured

Sources: src/tplr/metrics.py:254 , src/tplr/wandb.py:36-43