Dashboards

Relevant Source Files

telemetry/ansible/group_vars/all.yml.example telemetry/ansible/group_vars/vault.yml.example telemetry/ansible/roles/dashboards/files/eval-metrics.json telemetry/ansible/roles/dashboards/files/loki_logs.json telemetry/ansible/roles/dashboards/files/templar_dev.json telemetry/ansible/roles/dashboards/files/templar_metrics.json telemetry/ansible/roles/loki/templates/loki-config.yaml.j2

The Templar framework includes a comprehensive set of monitoring dashboards built with Grafana that provide real-time visibility into the distributed training process. These dashboards allow operators to track validator performance, model evaluation metrics, and system logs across the network.

For information about metrics collection and logging implementation, see Metrics Logging and Experiment Tracking.

Dashboard Architecture

The Templar monitoring system uses Grafana dashboards powered by InfluxDB for time-series metrics and Loki for logs. These dashboards are deployed through Ansible playbooks and provide visualizations for different aspects of the Templar system.

graph TD
    subgraph "Data Collection"
        VM["Validator Metrics"]
        MM["Miner Metrics"]
        EM["Evaluator Metrics"]
        LG["System Logs"]
    end

    subgraph "Storage Layer"
        IDB[(InfluxDB)]
        LK[(Loki)]
    end

    subgraph "Visualization Layer"
        TMD["Templar Metrics Dashboard"]
        EMD["Evaluation Metrics Dashboard"]
        LD["Logs Dashboard"]
        DD["Development Dashboard"]
    end

    VM --> IDB
    MM --> IDB
    EM --> IDB
    LG --> LK

    IDB --> TMD
    IDB --> EMD
    IDB --> DD
    LK --> LD

Sources: telemetry/ansible/group_vars/all.yml.example

Available Dashboards

Templar Metrics Dashboard

The main metrics dashboard provides comprehensive visibility into validator performance and is defined in templar_metrics.json. It includes panels for:

Validator loss tracking (by training window)
Time-based performance metrics
Processing time breakdowns
Global state metrics

graph TD
    subgraph "Validator Window Metrics"
        VL["Loss Metrics"]
        VPT["Peer Update Time"]
        VMT["Model Update Time"]
        VWT["Window Total Time"]
    end

    subgraph "Performance Metrics"
        GT["Gather Time"]
        ET["Evaluation Time"]
        TS["Total Skipped"]
    end

    subgraph "State Metrics"
        GS["Global Step"]
        WN["Window Number"]
    end

    subgraph "Dashboard Panels"
        LP["Loss Panels"]
        TP["Timing Panels"]
        SP["State Panels"]
    end

    VL --> LP
    VPT --> TP
    VMT --> TP
    VWT --> TP
    
    GT --> TP
    ET --> TP
    TS --> TP
    
    GS --> SP
    WN --> SP

Sources: telemetry/ansible/roles/dashboards/files/templar_metrics.json:66-748

Key Features

The Templar Metrics Dashboard includes:

Loss Tracking: Visualizes validator loss by window with both window-axis and time-axis views
Performance Monitoring: Tracks key timing metrics:
- Peer update time
- Model update time
- Window total time
- Gather time
- Evaluation time
State Monitoring: Shows global step and window number information
Filtering: Supports filtering by UID and version

All metrics are tagged with version, UID, and configuration parameters, enabling precise filtering and correlation across different system components.

Sources: telemetry/ansible/roles/dashboards/files/templar_metrics.json:179-198

Evaluation Metrics Dashboard

The evaluation dashboard (eval-metrics.json) focuses on model benchmark performance with trend charts for various benchmark tasks:

Hellaswag
MMLU
PIQA
ARC Challenge
ARC Easy
OpenBookQA
Winogrande

graph TD
    subgraph "Benchmark Tasks"
        HS["Hellaswag"]
        MM["MMLU"]
        PQ["PIQA"]
        AC["ARC Challenge"]
        AE["ARC Easy"]
        OB["OpenBookQA"]
        WG["Winogrande"]
    end

    subgraph "Visualization Types"
        TC["Trend Charts"]
        LS["Latest Scores"]
    end

    subgraph "Chart Components"
        SA["Step Axis"]
        TF["Timeline Filter"]
        VF["Version Filter"]
    end

    HS --> TC
    MM --> TC
    PQ --> TC
    AC --> TC
    AE --> TC
    OB --> TC
    WG --> TC
    
    TC --> LS
    
    SA --> TC
    TF --> TC
    VF --> TC

Sources: telemetry/ansible/roles/dashboards/files/eval-metrics.json:117-142 , telemetry/ansible/roles/dashboards/files/eval-metrics.json:204-236

Key Features

Step-based Tracking: Charts show model performance vs. training step
Multi-version Comparison: Supports comparing multiple Templar versions
Benchmark Aggregation: Displays all benchmark scores on a single panel
Time Range Selection: Customizable time range for historical analysis

Sources: telemetry/ansible/roles/dashboards/files/eval-metrics.json:737-773

Logs Dashboard

The Logs Dashboard (loki_logs.json) provides a centralized view of system logs from all Templar components:

graph TD
    subgraph "Log Sources"
        VM["Validator Logs"]
        MM["Miner Logs"]
        AG["Aggregator Logs"]
        EV["Evaluator Logs"]
    end

    subgraph "Loki Storage System"
        LK[(Loki)]
        FT["Fluentd"]
        R2["R2 Log Archival"]
    end

    subgraph "Dashboard Filters"
        SF["Service Filter"]
        VF["Version Filter"]
        UF["UID Filter"]
        LL["Log Level Filter"]
        TS["Text Search"]
    end

    VM --> LK
    MM --> LK
    AG --> LK
    EV --> LK
    
    LK <--> FT
    FT --> R2
    
    LK --> SF
    SF --> VF
    VF --> UF
    UF --> LL
    LL --> TS

Sources: telemetry/ansible/roles/dashboards/files/loki_logs.json:87-97 , telemetry/ansible/roles/loki/templates/loki-config.yaml.j2

Key Features

Service Filtering: Filter logs by service (validator, miner, etc.)
UID Filtering: Focus on specific node UIDs
Log Level Filtering: Filter by severity (error, warning, info, debug)
Text Search: Full-text search capabilities
Real-time Updates: Live streaming of new log entries
Integration: Links to other dashboards for context

Sources: telemetry/ansible/roles/dashboards/files/loki_logs.json:94-97 , telemetry/ansible/roles/dashboards/files/loki_logs.json:110-143

Development Dashboard

The Development Dashboard (templar_dev.json) provides similar metrics to the main dashboard but is configured specifically for development environments:

Additional filtering parameters
Development-specific metrics
Customized for local testing

Sources: telemetry/ansible/roles/dashboards/files/templar_dev.json:192-197

Dashboard Deployment and Configuration

The dashboards are deployed and configured using Ansible playbooks within the telemetry subsystem:

graph TD
    subgraph "Configuration Files"
        TM["templar_metrics.json"]
        EM["eval-metrics.json"]
        LL["loki_logs.json"]
        TD["templar_dev.json"]
    end

    subgraph "Ansible Deployment"
        AP["Ansible Playbook"]
        DT["Dashboard Templates"]
        DS["Datasource Config"]
    end

    subgraph "Grafana"
        GF["Grafana Server"]
        PR["Provisioning"]
        VA["Variables & Templates"]
    end

    TM --> DT
    EM --> DT
    LL --> DT
    TD --> DT
    
    DT --> AP
    DS --> AP
    
    AP --> PR
    PR --> GF
    GF --> VA

Sources: telemetry/ansible/group_vars/all.yml.example:6-69

Configuration Parameters

Key configuration parameters for the dashboard deployment include:

Parameter	Description	Default
`grafana_version`	Grafana version to install	10.4.0
`grafana_http_port`	HTTP port for Grafana	3000
`grafana_auth_anonymous_enabled`	Enable anonymous access	true
`grafana_influxdb_host`	InfluxDB host	localhost
`grafana_influxdb_port`	InfluxDB port	8086
`loki_http_listen_port`	Loki HTTP port	3100
`loki_retention_period`	Log retention period	720h (30 days)

Sources: telemetry/ansible/group_vars/all.yml.example:7-14 , telemetry/ansible/group_vars/all.yml.example:95-101

Vault Integration

Sensitive configuration values such as API tokens and credentials are stored in an Ansible vault:

InfluxDB tokens
Grafana admin credentials
R2 storage credentials

Sources: telemetry/ansible/group_vars/vault.yml.example

Using the Dashboards

Dashboard Variables

All dashboards use template variables for filtering:

templar_version: Filter by specific Templar version
uid: Filter by specific node UID
config_netuid: Filter by network UID (for dev dashboard)
log_level: Filter logs by severity level

Variables can be changed using the dropdown selectors at the top of each dashboard.

Sources: telemetry/ansible/roles/dashboards/files/loki_logs.json:111-205 , telemetry/ansible/roles/dashboards/files/eval-metrics.json:784-813

Interpreting Key Metrics

Validator Performance

The most important validator metrics to monitor are:

Loss Metrics: Track how loss changes across training windows
Timing Metrics: Monitor for processing bottlenecks
Skipped Updates: High skip counts may indicate problems
Global Step: Shows training progress

Evaluation Benchmarks

Evaluation charts show model performance on standard benchmarks:

Step-based Charts: Track improvement over training steps
Latest Scores Panel: Quick overview of current performance
Version Comparison: Compare different model versions

Log Analysis

The logs dashboard helps with troubleshooting:

Error Investigation: Filter for error/critical logs
Component Isolation: Filter by service name
UID Correlation: Track issues on specific nodes
Text Search: Find specific error messages or patterns

Best Practices

Regular Monitoring: Check dashboards regularly during training
Version Tagging: Always use consistent version tagging
Custom Time Ranges: Adjust time ranges for relevant views
Dashboard Links: Use links between dashboards for context
Variable Combinations: Combine multiple filters for precise analysis

Sources: telemetry/ansible/roles/dashboards/files/templar_metrics.json:27-64 , telemetry/ansible/roles/dashboards/files/loki_logs.json:28-40