Metrics Logging

Relevant Source Files

src/tplr/metrics.py telemetry/ansible/host_vars/grafana_prod.yml telemetry/simulator/testing_metrics.py tests/conftest.py tests/test_metrics_logger.py

Purpose and Scope

The Metrics Logging system in Templar provides comprehensive monitoring capabilities for tracking training progress, system performance, and resource utilization across the distributed training framework. It enables real-time collection of metrics from miners, validators, and other components, with data stored in InfluxDB for subsequent analysis and visualization. For information about visualizing this data through dashboards, see Dashboards. For experiment tracking with Weights & Biases, see Experiment Tracking.

Architecture Overview

The Metrics Logging system is built around the MetricsLogger class, which collects metrics data and asynchronously sends it to an InfluxDB instance.

Data Flow Diagram

flowchart TD
    subgraph "Templar Components"
        M["Miner"]
        V["Validator"]
        E["Evaluator"]
        A["Aggregator"]
    end
    
    subgraph "MetricsLogger"
        MLC["log() Method"]
        PT["Process & Transform"]
        PP["Point Preparation"]
        AQ["Async Queue"]
        BW["Background Writer"]
    end
    
    subgraph "Storage & Visualization"
        IDB["InfluxDB"]
        GF["Grafana Dashboards"]
    end
    
    M --> MLC
    V --> MLC
    E --> MLC
    A --> MLC
    
    MLC --> PT
    PT --> PP
    PP --> AQ
    AQ --> BW
    BW --> IDB
    IDB --> GF

Sources: src/tplr/metrics.py:82-360

MetricsLogger Internal Architecture

flowchart TD
    subgraph "Public Interface"
        LOG["log(measurement, tags, fields)"]
        PV["process_value(v)"]
    end
    
    subgraph "Internal Components"
        PF["_process_fields()"]
        ASM["_add_system_metrics()"]
        AGM["_add_gpu_metrics()"]
        AT["_add_tags()"]
        AST["_add_standard_tags()"]
        ACT["_add_config_tags()"]
        Q["Queue"]
        CON["_consumer()"]
        H["_handle()"]
        WP["_write_point()"]
    end
    
    subgraph "External Dependencies"
        IDB["InfluxDBClient"]
        WA["write_api"]
        P["Point"]
    end
    
    LOG --> PF
    PF --> PV
    PF --> ASM
    PF --> AGM
    LOG --> AT
    AT --> AST
    AT --> ACT
    LOG --> Q
    Q --> CON
    CON --> H
    H --> WP
    WP --> WA
    WA --> IDB
    
    ASM --> GSM["get_system_metrics()"]
    AGM --> GGM["get_gpu_metrics()"]

Sources: src/tplr/metrics.py:182-255

Core Components

MetricsLogger Class

The MetricsLogger class is the central component of the metrics logging system. It handles initialization of the InfluxDB connection, processes data types appropriately, and manages asynchronous writes to avoid blocking the main training loop.

classDiagram
    class MetricsLogger {
        +__init__(host, port, database, token, org, prefix, uid, role, config, group, job_type)
        +process_value(v)
        +log(measurement, tags, fields, timestamp, with_system_metrics, with_gpu_metrics)
        -_run()
        -_consumer()
        -_handle(point)
        -_write_point(point)
        -_process_fields(fields)
        -_add_system_metrics(fields)
        -_add_gpu_metrics(tags, fields)
        -_add_tags(point, tags)
        -_add_standard_tags(point)
        -_add_config_tags(point)
    }

Sources: src/tplr/metrics.py:82-321

Configuration Options

The metrics logger can be configured with the following parameters:

Parameter	Description	Default
`host`	InfluxDB host	From `INFLUXDB_HOST` env var or AWS TimeStream instance
`port`	InfluxDB port	From `INFLUXDB_PORT` env var or `8086`
`database`	InfluxDB bucket	From `INFLUXDB_DATABASE` env var or `"tplr"`
`token`	Authentication token	From `INFLUXDB_TOKEN` env var or fallback
`org`	InfluxDB organization	From `INFLUXDB_ORG` env var or `"tplr"`
`prefix`	Metric name prefix	`""`
`uid`	Unique identifier	`None`
`role`	Node role	`""` (e.g., “miner”, “validator”)
`config`	Bittensor config	`None`
`group`	Group identifier	`""`
`job_type`	Job type	`""`
`max_queue_size`	Max queue size	`600_000`
`max_workers`	Thread pool size	`1`

Sources: src/tplr/metrics.py:88-105 , src/tplr/metrics.py:44-53

Data Processing

The process_value method handles different types of data with type-specific processing:

Data Type	Processing Behavior
Integer	Preserved as integer
Float	Preserved as float
List of integers	Converted to string (for peer IDs)
List of numbers	Converted to statistics object (mean, min, max, median)
String	Preserved as string
None	Converted to `0.0`
Other types	Converted to string

Sources: src/tplr/metrics.py:154-180

System and GPU Metrics Collection

The metrics logger can automatically collect and include system and GPU metrics when requested:

# System metrics collected (when with_system_metrics=True)
{
    "sys_cpu_usage": <percent>,             # CPU utilization percentage
    "sys_mem_used": <megabytes>,            # RAM used in MB
    "sys_mem_total": <megabytes>            # Total RAM in MB
}

# GPU metrics collected (when with_gpu_metrics=True)
{
    "gpu_mem_segments": <count>,            # GPU memory segments
    "gpu_mem_allocated_mb": <megabytes>,    # Allocated GPU memory
    "gpu_mem_cached_mb": <megabytes>,       # Cached GPU memory
    "gpu_mem_total_mb": <megabytes>,        # Total GPU memory
    "gpu_name": <device_name>,              # GPU device name (as tag)
    "gpu_id": <device_id>                   # GPU device ID (as tag)
}

Sources: src/tplr/metrics.py:324-359

Asynchronous Processing

The MetricsLogger uses asynchronous processing to avoid blocking the main thread during metric collection:

Each log call adds a point to an async queue
A background thread continuously processes items from the queue
Write operations are performed in a thread pool executor
Failed writes are logged but do not crash the application

Sources: src/tplr/metrics.py:227-255

Usage Examples

Basic Initialization

from tplr.metrics import MetricsLogger

# Basic initialization with role
logger = MetricsLogger(
    prefix="miner",
    uid="123",
    role="miner",
    group="training_group"
)

# With custom InfluxDB settings
logger = MetricsLogger(
    host="custom.influxdb.host",
    port=8086,
    token="your-token",
    org="your-org",
    database="your-bucket",
    prefix="validator",
    uid="456",
    role="validator"
)

Logging Metrics

# Basic metrics logging
logger.log(
    measurement="training_step",
    tags={"step": 42},
    fields={"loss": 0.75}
)

# With system and GPU metrics
logger.log(
    measurement="training_step",
    tags={"batch": 100, "epoch": 5},
    fields={"loss": 0.345, "accuracy": 0.92},
    with_system_metrics=True,
    with_gpu_metrics=True
)

# Logging metrics with lists that will be processed automatically
logger.log(
    measurement="training_step",
    tags={"step": 42},
    fields={
        "peer_ids": [1, 2, 3, 4],           # Will be stored as string
        "losses": [0.1, 0.2, 0.3, 0.4]      # Will be stored as statistics
    }
)

Sources: src/tplr/metrics.py:182-201

Integration with InfluxDB

The MetricsLogger integrates with InfluxDB via the official InfluxDB client library. It uses:

A custom write options class (MertricsLoggerWriteOptions) for configuration
The Point class for structuring metrics data
Batched writes with configurable batch sizes (defaults: 10,000 for validators, 1,000 for other roles)
The InfluxDB 2.0 API with organization and bucket concepts

Sources: src/tplr/metrics.py:62-79 , src/tplr/metrics.py:130-136

Best Practices

Use consistent measurement names: Keep measurement names consistent across related metrics for easier querying
Add contextual tags: Include tags like step, epoch, batch to segment and filter data
Include system metrics for performance-related data: Always enable with_system_metrics=True when logging performance data
Group related metrics in single calls: Combine related metrics in a single log() call rather than multiple calls
Handle exceptions: The logger is designed to handle exceptions internally, but be aware of potential connection issues
Use appropriate value types: The process_value() method handles type conversion, but using appropriate types initially improves clarity

Sources: tests/test_metrics_logger.py:316-331

Testing and Debugging

The metrics logger includes comprehensive test coverage. For troubleshooting:

Enable debug logging to see detailed information about metric writes
Use the testing_metrics.py script to verify connectivity to InfluxDB
Check the Grafana dashboards to confirm metrics are being received

Sources: tests/test_metrics_logger.py:1-342 , telemetry/simulator/testing_metrics.py:1-52