Monitoring and Telemetry
This document describes the monitoring and telemetry system implemented in the Templar framework. It covers the metrics collection, logging infrastructure, and visualization tools that provide observability into the distributed training process. For information about deploying the monitoring services, see Deployment.
Overview
Section titled “Overview”Templar’s monitoring and telemetry system provides comprehensive visibility into the distributed training process through three main components:
- Metrics Logging - Collection and storage of time-series metrics in InfluxDB
- Experiment Tracking - Tracking of training runs with Weights & Biases (WandB)
- Structured Logging - Centralized logging with Loki for operational visibility
These components work together to provide a complete view of the training process, from model performance metrics to system resource utilization.
flowchart TD subgraph "Templar Nodes" M["Miners"] V["Validators"] A["Aggregator"] E["Evaluator"] end subgraph "Telemetry Infrastructure" I["InfluxDB"] W["Weights & Biases"] L["Loki"] G["Grafana"] end M -->|"Performance & System Metrics"| I V -->|"Performance & System Metrics"| I A -->|"Performance & System Metrics"| I E -->|"Performance & System Metrics"| I M -->|"Experiment Tracking"| W V -->|"Experiment Tracking"| W A -->|"Experiment Tracking"| W E -->|"Experiment Tracking"| W M -->|"Structured Logs"| L V -->|"Structured Logs"| L A -->|"Structured Logs"| L E -->|"Structured Logs"| L I -->|"Data Source"| G L -->|"Data Source"| G
Sources: src/tplr/metrics.py , src/tplr/wandb.py , src/tplr/logging.py
Metrics Logging with InfluxDB
Section titled “Metrics Logging with InfluxDB”Templar uses a custom MetricsLogger
class to collect and store metrics in InfluxDB. This system provides insights into training performance, system resource utilization, and gradient processing efficiency.
Architecture
Section titled “Architecture”The metrics logging system is designed to be asynchronous and non-blocking to minimize impact on the training process. It uses a queue-based architecture to buffer metrics before sending them to InfluxDB.
sequenceDiagram participant Node as "Templar Node" participant Logger as "MetricsLogger" participant Queue as "Async Queue" participant Thread as "Background Thread" participant InfluxDB as "InfluxDB" Node->>Logger: log(measurement, tags, fields) Logger->>Logger: Process fields & add metadata Logger->>Queue: Enqueue Point Note over Queue,Thread: Non-blocking operation Thread->>Queue: Dequeue Point Thread->>InfluxDB: Write point (retry if needed)
Sources: src/tplr/metrics.py:182-226
MetricsLogger Class
Section titled “MetricsLogger Class”The MetricsLogger
class provides a simple interface for logging metrics while handling the complexity of asynchronous communication with InfluxDB.
Key features:
- Automatic collection of system and GPU metrics
- Statistical processing of list values (mean, min, max, median)
- Tagging with runtime information (version, node role, etc.)
- Configurable batching and retry logic
classDiagram class MetricsLogger { +client: InfluxDBClient +write_api: WriteApi +prefix: str +uid: str +role: str +version: str +runtime_id: str +log(measurement, tags, fields) +process_value(value) -_add_system_metrics(fields) -_add_gpu_metrics(tags, fields) -_add_standard_tags(point) -_add_config_tags(point) -_write_point(point) }
Sources: src/tplr/metrics.py:82-321
Configuration
Section titled “Configuration”The MetricsLogger
can be configured through environment variables and constructor parameters:
Parameter | Environment Variable | Default | Description |
---|---|---|---|
host | INFLUXDB_HOST | AWS InfluxDB endpoint | InfluxDB server hostname |
port | INFLUXDB_PORT | 8086 | InfluxDB server port |
database | INFLUXDB_DATABASE | ”tplr” | InfluxDB bucket/database name |
token | INFLUXDB_TOKEN | Fallback token | Authentication token |
org | INFLUXDB_ORG | ”tplr” | InfluxDB organization |
prefix | - | "" | Prefix for all metrics |
role | - | "" | Node role (miner, validator, etc.) |
Sources: src/tplr/metrics.py:45-59 , src/tplr/metrics.py:88-105
System and GPU Metrics
Section titled “System and GPU Metrics”The metrics logging system automatically collects system and GPU metrics when requested:
System Metrics:
- CPU usage percentage
- Memory usage (used and total)
GPU Metrics:
- GPU memory allocated
- GPU memory cached/reserved
- GPU memory segments
- Total GPU memory used
# Example usagemetrics_logger.log( measurement="training_step", tags={"uid": 42}, fields={"loss": 0.75, "learning_rate": 0.001}, with_system_metrics=True, with_gpu_metrics=True)
Sources: src/tplr/metrics.py:324-359
Experiment Tracking with Weights & Biases
Section titled “Experiment Tracking with Weights & Biases”Templar integrates with Weights & Biases (WandB) for experiment tracking and visualization. This integration provides a way to track training progress, compare different runs, and visualize model performance.
Integration Architecture
Section titled “Integration Architecture”The WandB integration is designed to handle versioning and run resumption, ensuring that training runs are properly tracked even if they span multiple sessions or software versions.
flowchart TD subgraph "Templar Node" Init["initialize_wandb()"] Log["run.log()"] end subgraph "WandB" API["WandB API"] RunObj["Run Object"] Storage["Cloud Storage"] end Init -->|"Create/Resume Run"| API API -->|"Return Run Object"| RunObj Log -->|"Log metrics with version prefix"| RunObj RunObj -->|"Store metrics & artifacts"| Storage
Sources: src/tplr/wandb.py:20-125
Version Tracking
Section titled “Version Tracking”A unique feature of Templar’s WandB integration is version tracking. Metrics are automatically prefixed with the current software version, allowing for easier comparison of performance across different software versions.
v0.1.0/lossv0.1.0/steplatest/losslatest/step
The system also maintains a version history in the run configuration, enabling tracking of when a run was executed with different versions of the software.
Sources: src/tplr/wandb.py:62-68 , src/tplr/wandb.py:104-116
Run Resumption
Section titled “Run Resumption”The integration supports run resumption by maintaining a record of run IDs and checking for existing runs when initializing WandB. This allows for seamless continuation of training across multiple sessions.
# Run ID is stored in a filerun_id_file = os.path.join(wandb_dir, f"wandb_run_id_{run_prefix}{uid}.txt")
Sources: src/tplr/wandb.py:28-43 , src/tplr/wandb.py:121-123
Structured Logging with Loki
Section titled “Structured Logging with Loki”Templar uses a structured logging system with Loki integration for centralized log collection and analysis. This system provides operational visibility into the distributed training process.
Architecture
Section titled “Architecture”The logging system is built on Python’s standard logging module with enhancements for structured logging and Loki integration. It uses a queue-based architecture for asynchronous log transmission.
flowchart TD subgraph "Templar Node" Log["logger.info()"] QH["QueueHandler"] Rich["RichHandler"] end subgraph "Background Processing" QL["QueueListener"] LH["LokiHandler"] end subgraph "Infrastructure" Loki["Loki Server"] Console["Console Output"] end Log -->|"Log Record"| QH Log -->|"Log Record"| Rich QH -->|"Enqueue"| QL QL -->|"Dequeue"| LH Rich -->|"Format & Display"| Console LH -->|"HTTP POST"| Loki
Sources: src/tplr/logging.py:149-287
Context-based Logging
Section titled “Context-based Logging”The logging system supports context-based logging, allowing additional metadata to be attached to log messages. This metadata can include information like trace IDs, batch identifiers, or custom metrics.
# Example context-based logginglog_with_context( 'info', 'Processing batch', batch_size=32, batch_id='abc123')
Sources: src/tplr/logging.py:290-309
Log Filtering
Section titled “Log Filtering”The logging system includes custom filters to reduce noise and focus on relevant information. For example, it filters out common subtensor warnings that are not actionable.
class NoSubtensorWarning(logging.Filter): def filter(self, record: logging.LogRecord) -> bool: # Return False if the record contains the undesired subtensor warning return ( "Verify your local subtensor is running on port" not in record.getMessage() )
Sources: src/tplr/logging.py:91-96
Dashboards and Visualization
Section titled “Dashboards and Visualization”Templar uses Grafana for dashboard visualization of metrics stored in InfluxDB and logs stored in Loki. This provides a unified view of system performance and health.
Grafana Configuration
Section titled “Grafana Configuration”Grafana is configured to connect to InfluxDB and Loki as data sources. The configuration is managed through Ansible for automated deployment.
flowchart TD subgraph "Data Sources" I["InfluxDB"] L["Loki"] end subgraph "Grafana" G["Grafana Server"] D1["Training Metrics Dashboard"] D2["System Metrics Dashboard"] D3["Log Analytics Dashboard"] end I -->|"Time Series Data"| G L -->|"Log Data"| G G --- D1 G --- D2 G --- D3
Sources: telemetry/ansible/host_vars/grafana_prod.yml
Available Metrics
Section titled “Available Metrics”The monitoring system collects a wide range of metrics that can be visualized in Grafana:
Category | Metrics |
---|---|
Training | Loss, learning rate, gradient norms |
System | CPU usage, memory usage |
GPU | Memory allocated, memory cached |
Network | Active peers, sync status |
Performance | Gradient compression ratio, training throughput |
Sources: src/tplr/metrics.py:182-226 , tests/test_metrics_logger.py:315-331
Integration in the Templar Ecosystem
Section titled “Integration in the Templar Ecosystem”The monitoring and telemetry system is integrated throughout the Templar ecosystem, providing visibility into all components of the distributed training process.
flowchart TD subgraph "Templar Framework" M["Miner"] V["Validator"] A["Aggregator"] C["Comms"] E["Evaluator"] end subgraph "Metrics Collection" ML["MetricsLogger"] W["WandB Integration"] Log["Structured Logger"] end subgraph "Infrastructure" I["InfluxDB"] WB["Weights & Biases"] L["Loki"] G["Grafana"] end M -->|"Training Metrics"| ML V -->|"Validation Metrics"| ML A -->|"Aggregation Metrics"| ML C -->|"Communication Metrics"| ML E -->|"Evaluation Metrics"| ML M -->|"Experiment Data"| W V -->|"Experiment Data"| W A -->|"Experiment Data"| W E -->|"Experiment Data"| W M -->|"Operational Logs"| Log V -->|"Operational Logs"| Log A -->|"Operational Logs"| Log C -->|"Operational Logs"| Log E -->|"Operational Logs"| Log ML -->|"Time Series Data"| I W -->|"Run Data"| WB Log -->|"Structured Logs"| L I -->|"Data Source"| G L -->|"Data Source"| G
Sources: src/tplr/metrics.py , src/tplr/wandb.py , src/tplr/logging.py
Usage Examples
Section titled “Usage Examples”Logging Training Metrics
Section titled “Logging Training Metrics”# Initialize MetricsLoggermetrics_logger = MetricsLogger( prefix="miner", uid="123", role="miner", group="finney", job_type="training")
# Log training metricsmetrics_logger.log( measurement="training_step", tags={"batch_id": batch_id}, fields={ "loss": loss.item(), "learning_rate": current_lr, "gradient_norm": gradient_norm }, with_system_metrics=True, with_gpu_metrics=True)
Sources: src/tplr/metrics.py:88-105 , src/tplr/metrics.py:182-226
Initializing W&B for Experiment Tracking
Section titled “Initializing W&B for Experiment Tracking”# Initialize W&Brun = initialize_wandb( run_prefix="miner_", uid="123", config=config, group="finney", job_type="training")
# Log metrics to W&Brun.log({ "loss": loss.item(), "learning_rate": current_lr, "gradient_norm": gradient_norm})
Sources: src/tplr/wandb.py:20-125
Setting Up Structured Logging
Section titled “Setting Up Structured Logging”# Setup Loki loggerlogger = setup_loki_logger( service="miner", uid="123", version="0.1.0", environment="finney")
# Log with contextlogger.log_with_context( "info", "Completed training step", step=current_step, loss=loss.item(), duration=step_duration)
Sources: src/tplr/logging.py:149-287 , src/tplr/logging.py:290-309
Troubleshooting
Section titled “Troubleshooting”If metrics are not appearing in Grafana dashboards:
- Check that the
INFLUXDB_TOKEN
environment variable is set correctly - Verify that the InfluxDB host is reachable from the Templar nodes
- Check the console logs for any errors related to metrics logging
- Verify that Grafana is properly configured to connect to InfluxDB and Loki
If W&B integration is not working:
- Check for the existence of a
wandb_run_id_*.txt
file in thewandb
directory - Verify that the W&B API can be reached from the Templar nodes
- Check that the project name is correctly configured
Sources: src/tplr/metrics.py:254 , src/tplr/wandb.py:36-43