Skip to content

Metrics Logging

Relevant Source Files

The Metrics Logging system in Templar provides comprehensive monitoring capabilities for tracking training progress, system performance, and resource utilization across the distributed training framework. It enables real-time collection of metrics from miners, validators, and other components, with data stored in InfluxDB for subsequent analysis and visualization. For information about visualizing this data through dashboards, see Dashboards. For experiment tracking with Weights & Biases, see Experiment Tracking.

The Metrics Logging system is built around the MetricsLogger class, which collects metrics data and asynchronously sends it to an InfluxDB instance.

flowchart TD
    subgraph "Templar Components"
        M["Miner"]
        V["Validator"]
        E["Evaluator"]
        A["Aggregator"]
    end
    
    subgraph "MetricsLogger"
        MLC["log() Method"]
        PT["Process & Transform"]
        PP["Point Preparation"]
        AQ["Async Queue"]
        BW["Background Writer"]
    end
    
    subgraph "Storage & Visualization"
        IDB["InfluxDB"]
        GF["Grafana Dashboards"]
    end
    
    M --> MLC
    V --> MLC
    E --> MLC
    A --> MLC
    
    MLC --> PT
    PT --> PP
    PP --> AQ
    AQ --> BW
    BW --> IDB
    IDB --> GF

Sources: src/tplr/metrics.py:82-360

flowchart TD
    subgraph "Public Interface"
        LOG["log(measurement, tags, fields)"]
        PV["process_value(v)"]
    end
    
    subgraph "Internal Components"
        PF["_process_fields()"]
        ASM["_add_system_metrics()"]
        AGM["_add_gpu_metrics()"]
        AT["_add_tags()"]
        AST["_add_standard_tags()"]
        ACT["_add_config_tags()"]
        Q["Queue"]
        CON["_consumer()"]
        H["_handle()"]
        WP["_write_point()"]
    end
    
    subgraph "External Dependencies"
        IDB["InfluxDBClient"]
        WA["write_api"]
        P["Point"]
    end
    
    LOG --> PF
    PF --> PV
    PF --> ASM
    PF --> AGM
    LOG --> AT
    AT --> AST
    AT --> ACT
    LOG --> Q
    Q --> CON
    CON --> H
    H --> WP
    WP --> WA
    WA --> IDB
    
    ASM --> GSM["get_system_metrics()"]
    AGM --> GGM["get_gpu_metrics()"]

Sources: src/tplr/metrics.py:182-255

The MetricsLogger class is the central component of the metrics logging system. It handles initialization of the InfluxDB connection, processes data types appropriately, and manages asynchronous writes to avoid blocking the main training loop.

classDiagram
    class MetricsLogger {
        +__init__(host, port, database, token, org, prefix, uid, role, config, group, job_type)
        +process_value(v)
        +log(measurement, tags, fields, timestamp, with_system_metrics, with_gpu_metrics)
        -_run()
        -_consumer()
        -_handle(point)
        -_write_point(point)
        -_process_fields(fields)
        -_add_system_metrics(fields)
        -_add_gpu_metrics(tags, fields)
        -_add_tags(point, tags)
        -_add_standard_tags(point)
        -_add_config_tags(point)
    }

Sources: src/tplr/metrics.py:82-321

The metrics logger can be configured with the following parameters:

ParameterDescriptionDefault
hostInfluxDB hostFrom INFLUXDB_HOST env var or AWS TimeStream instance
portInfluxDB portFrom INFLUXDB_PORT env var or 8086
databaseInfluxDB bucketFrom INFLUXDB_DATABASE env var or "tplr"
tokenAuthentication tokenFrom INFLUXDB_TOKEN env var or fallback
orgInfluxDB organizationFrom INFLUXDB_ORG env var or "tplr"
prefixMetric name prefix""
uidUnique identifierNone
roleNode role"" (e.g., “miner”, “validator”)
configBittensor configNone
groupGroup identifier""
job_typeJob type""
max_queue_sizeMax queue size600_000
max_workersThread pool size1

Sources: src/tplr/metrics.py:88-105 , src/tplr/metrics.py:44-53

The process_value method handles different types of data with type-specific processing:

Data TypeProcessing Behavior
IntegerPreserved as integer
FloatPreserved as float
List of integersConverted to string (for peer IDs)
List of numbersConverted to statistics object (mean, min, max, median)
StringPreserved as string
NoneConverted to 0.0
Other typesConverted to string

Sources: src/tplr/metrics.py:154-180

The metrics logger can automatically collect and include system and GPU metrics when requested:

# System metrics collected (when with_system_metrics=True)
{
"sys_cpu_usage": <percent>, # CPU utilization percentage
"sys_mem_used": <megabytes>, # RAM used in MB
"sys_mem_total": <megabytes> # Total RAM in MB
}
# GPU metrics collected (when with_gpu_metrics=True)
{
"gpu_mem_segments": <count>, # GPU memory segments
"gpu_mem_allocated_mb": <megabytes>, # Allocated GPU memory
"gpu_mem_cached_mb": <megabytes>, # Cached GPU memory
"gpu_mem_total_mb": <megabytes>, # Total GPU memory
"gpu_name": <device_name>, # GPU device name (as tag)
"gpu_id": <device_id> # GPU device ID (as tag)
}

Sources: src/tplr/metrics.py:324-359

The MetricsLogger uses asynchronous processing to avoid blocking the main thread during metric collection:

  1. Each log call adds a point to an async queue
  2. A background thread continuously processes items from the queue
  3. Write operations are performed in a thread pool executor
  4. Failed writes are logged but do not crash the application

Sources: src/tplr/metrics.py:227-255

from tplr.metrics import MetricsLogger
# Basic initialization with role
logger = MetricsLogger(
prefix="miner",
uid="123",
role="miner",
group="training_group"
)
# With custom InfluxDB settings
logger = MetricsLogger(
host="custom.influxdb.host",
port=8086,
token="your-token",
org="your-org",
database="your-bucket",
prefix="validator",
uid="456",
role="validator"
)
# Basic metrics logging
logger.log(
measurement="training_step",
tags={"step": 42},
fields={"loss": 0.75}
)
# With system and GPU metrics
logger.log(
measurement="training_step",
tags={"batch": 100, "epoch": 5},
fields={"loss": 0.345, "accuracy": 0.92},
with_system_metrics=True,
with_gpu_metrics=True
)
# Logging metrics with lists that will be processed automatically
logger.log(
measurement="training_step",
tags={"step": 42},
fields={
"peer_ids": [1, 2, 3, 4], # Will be stored as string
"losses": [0.1, 0.2, 0.3, 0.4] # Will be stored as statistics
}
)

Sources: src/tplr/metrics.py:182-201

The MetricsLogger integrates with InfluxDB via the official InfluxDB client library. It uses:

  1. A custom write options class (MertricsLoggerWriteOptions) for configuration
  2. The Point class for structuring metrics data
  3. Batched writes with configurable batch sizes (defaults: 10,000 for validators, 1,000 for other roles)
  4. The InfluxDB 2.0 API with organization and bucket concepts

Sources: src/tplr/metrics.py:62-79 , src/tplr/metrics.py:130-136

  1. Use consistent measurement names: Keep measurement names consistent across related metrics for easier querying

  2. Add contextual tags: Include tags like step, epoch, batch to segment and filter data

  3. Include system metrics for performance-related data: Always enable with_system_metrics=True when logging performance data

  4. Group related metrics in single calls: Combine related metrics in a single log() call rather than multiple calls

  5. Handle exceptions: The logger is designed to handle exceptions internally, but be aware of potential connection issues

  6. Use appropriate value types: The process_value() method handles type conversion, but using appropriate types initially improves clarity

Sources: tests/test_metrics_logger.py:316-331

The metrics logger includes comprehensive test coverage. For troubleshooting:

  1. Enable debug logging to see detailed information about metric writes
  2. Use the testing_metrics.py script to verify connectivity to InfluxDB
  3. Check the Grafana dashboards to confirm metrics are being received

Sources: tests/test_metrics_logger.py:1-342 , telemetry/simulator/testing_metrics.py:1-52