Dashboards
The Templar framework includes a comprehensive set of monitoring dashboards built with Grafana that provide real-time visibility into the distributed training process. These dashboards allow operators to track validator performance, model evaluation metrics, and system logs across the network.
For information about metrics collection and logging implementation, see Metrics Logging and Experiment Tracking.
Dashboard Architecture
Section titled “Dashboard Architecture”The Templar monitoring system uses Grafana dashboards powered by InfluxDB for time-series metrics and Loki for logs. These dashboards are deployed through Ansible playbooks and provide visualizations for different aspects of the Templar system.
graph TD subgraph "Data Collection" VM["Validator Metrics"] MM["Miner Metrics"] EM["Evaluator Metrics"] LG["System Logs"] end subgraph "Storage Layer" IDB[(InfluxDB)] LK[(Loki)] end subgraph "Visualization Layer" TMD["Templar Metrics Dashboard"] EMD["Evaluation Metrics Dashboard"] LD["Logs Dashboard"] DD["Development Dashboard"] end VM --> IDB MM --> IDB EM --> IDB LG --> LK IDB --> TMD IDB --> EMD IDB --> DD LK --> LD
Sources: telemetry/ansible/group_vars/all.yml.example
Available Dashboards
Section titled “Available Dashboards”Templar Metrics Dashboard
Section titled “Templar Metrics Dashboard”The main metrics dashboard provides comprehensive visibility into validator performance and is defined in templar_metrics.json
. It includes panels for:
- Validator loss tracking (by training window)
- Time-based performance metrics
- Processing time breakdowns
- Global state metrics
graph TD subgraph "Validator Window Metrics" VL["Loss Metrics"] VPT["Peer Update Time"] VMT["Model Update Time"] VWT["Window Total Time"] end subgraph "Performance Metrics" GT["Gather Time"] ET["Evaluation Time"] TS["Total Skipped"] end subgraph "State Metrics" GS["Global Step"] WN["Window Number"] end subgraph "Dashboard Panels" LP["Loss Panels"] TP["Timing Panels"] SP["State Panels"] end VL --> LP VPT --> TP VMT --> TP VWT --> TP GT --> TP ET --> TP TS --> TP GS --> SP WN --> SP
Sources: telemetry/ansible/roles/dashboards/files/templar_metrics.json:66-748
Key Features
Section titled “Key Features”The Templar Metrics Dashboard includes:
- Loss Tracking: Visualizes validator loss by window with both window-axis and time-axis views
- Performance Monitoring: Tracks key timing metrics:
- Peer update time
- Model update time
- Window total time
- Gather time
- Evaluation time
- State Monitoring: Shows global step and window number information
- Filtering: Supports filtering by UID and version
All metrics are tagged with version, UID, and configuration parameters, enabling precise filtering and correlation across different system components.
Sources: telemetry/ansible/roles/dashboards/files/templar_metrics.json:179-198
Evaluation Metrics Dashboard
Section titled “Evaluation Metrics Dashboard”The evaluation dashboard (eval-metrics.json
) focuses on model benchmark performance with trend charts for various benchmark tasks:
- Hellaswag
- MMLU
- PIQA
- ARC Challenge
- ARC Easy
- OpenBookQA
- Winogrande
graph TD subgraph "Benchmark Tasks" HS["Hellaswag"] MM["MMLU"] PQ["PIQA"] AC["ARC Challenge"] AE["ARC Easy"] OB["OpenBookQA"] WG["Winogrande"] end subgraph "Visualization Types" TC["Trend Charts"] LS["Latest Scores"] end subgraph "Chart Components" SA["Step Axis"] TF["Timeline Filter"] VF["Version Filter"] end HS --> TC MM --> TC PQ --> TC AC --> TC AE --> TC OB --> TC WG --> TC TC --> LS SA --> TC TF --> TC VF --> TC
Sources: telemetry/ansible/roles/dashboards/files/eval-metrics.json:117-142 , telemetry/ansible/roles/dashboards/files/eval-metrics.json:204-236
Key Features
Section titled “Key Features”- Step-based Tracking: Charts show model performance vs. training step
- Multi-version Comparison: Supports comparing multiple Templar versions
- Benchmark Aggregation: Displays all benchmark scores on a single panel
- Time Range Selection: Customizable time range for historical analysis
Sources: telemetry/ansible/roles/dashboards/files/eval-metrics.json:737-773
Logs Dashboard
Section titled “Logs Dashboard”The Logs Dashboard (loki_logs.json
) provides a centralized view of system logs from all Templar components:
graph TD subgraph "Log Sources" VM["Validator Logs"] MM["Miner Logs"] AG["Aggregator Logs"] EV["Evaluator Logs"] end subgraph "Loki Storage System" LK[(Loki)] FT["Fluentd"] R2["R2 Log Archival"] end subgraph "Dashboard Filters" SF["Service Filter"] VF["Version Filter"] UF["UID Filter"] LL["Log Level Filter"] TS["Text Search"] end VM --> LK MM --> LK AG --> LK EV --> LK LK <--> FT FT --> R2 LK --> SF SF --> VF VF --> UF UF --> LL LL --> TS
Sources: telemetry/ansible/roles/dashboards/files/loki_logs.json:87-97 , telemetry/ansible/roles/loki/templates/loki-config.yaml.j2
Key Features
Section titled “Key Features”- Service Filtering: Filter logs by service (validator, miner, etc.)
- UID Filtering: Focus on specific node UIDs
- Log Level Filtering: Filter by severity (error, warning, info, debug)
- Text Search: Full-text search capabilities
- Real-time Updates: Live streaming of new log entries
- Integration: Links to other dashboards for context
Sources: telemetry/ansible/roles/dashboards/files/loki_logs.json:94-97 , telemetry/ansible/roles/dashboards/files/loki_logs.json:110-143
Development Dashboard
Section titled “Development Dashboard”The Development Dashboard (templar_dev.json
) provides similar metrics to the main dashboard but is configured specifically for development environments:
- Additional filtering parameters
- Development-specific metrics
- Customized for local testing
Sources: telemetry/ansible/roles/dashboards/files/templar_dev.json:192-197
Dashboard Deployment and Configuration
Section titled “Dashboard Deployment and Configuration”The dashboards are deployed and configured using Ansible playbooks within the telemetry subsystem:
graph TD subgraph "Configuration Files" TM["templar_metrics.json"] EM["eval-metrics.json"] LL["loki_logs.json"] TD["templar_dev.json"] end subgraph "Ansible Deployment" AP["Ansible Playbook"] DT["Dashboard Templates"] DS["Datasource Config"] end subgraph "Grafana" GF["Grafana Server"] PR["Provisioning"] VA["Variables & Templates"] end TM --> DT EM --> DT LL --> DT TD --> DT DT --> AP DS --> AP AP --> PR PR --> GF GF --> VA
Sources: telemetry/ansible/group_vars/all.yml.example:6-69
Configuration Parameters
Section titled “Configuration Parameters”Key configuration parameters for the dashboard deployment include:
Parameter | Description | Default |
---|---|---|
grafana_version | Grafana version to install | 10.4.0 |
grafana_http_port | HTTP port for Grafana | 3000 |
grafana_auth_anonymous_enabled | Enable anonymous access | true |
grafana_influxdb_host | InfluxDB host | localhost |
grafana_influxdb_port | InfluxDB port | 8086 |
loki_http_listen_port | Loki HTTP port | 3100 |
loki_retention_period | Log retention period | 720h (30 days) |
Sources: telemetry/ansible/group_vars/all.yml.example:7-14 , telemetry/ansible/group_vars/all.yml.example:95-101
Vault Integration
Section titled “Vault Integration”Sensitive configuration values such as API tokens and credentials are stored in an Ansible vault:
- InfluxDB tokens
- Grafana admin credentials
- R2 storage credentials
Sources: telemetry/ansible/group_vars/vault.yml.example
Using the Dashboards
Section titled “Using the Dashboards”Dashboard Variables
Section titled “Dashboard Variables”All dashboards use template variables for filtering:
- templar_version: Filter by specific Templar version
- uid: Filter by specific node UID
- config_netuid: Filter by network UID (for dev dashboard)
- log_level: Filter logs by severity level
Variables can be changed using the dropdown selectors at the top of each dashboard.
Sources: telemetry/ansible/roles/dashboards/files/loki_logs.json:111-205 , telemetry/ansible/roles/dashboards/files/eval-metrics.json:784-813
Interpreting Key Metrics
Section titled “Interpreting Key Metrics”Validator Performance
Section titled “Validator Performance”The most important validator metrics to monitor are:
- Loss Metrics: Track how loss changes across training windows
- Timing Metrics: Monitor for processing bottlenecks
- Skipped Updates: High skip counts may indicate problems
- Global Step: Shows training progress
Evaluation Benchmarks
Section titled “Evaluation Benchmarks”Evaluation charts show model performance on standard benchmarks:
- Step-based Charts: Track improvement over training steps
- Latest Scores Panel: Quick overview of current performance
- Version Comparison: Compare different model versions
Log Analysis
Section titled “Log Analysis”The logs dashboard helps with troubleshooting:
- Error Investigation: Filter for error/critical logs
- Component Isolation: Filter by service name
- UID Correlation: Track issues on specific nodes
- Text Search: Find specific error messages or patterns
Best Practices
Section titled “Best Practices”- Regular Monitoring: Check dashboards regularly during training
- Version Tagging: Always use consistent version tagging
- Custom Time Ranges: Adjust time ranges for relevant views
- Dashboard Links: Use links between dashboards for context
- Variable Combinations: Combine multiple filters for precise analysis
Sources: telemetry/ansible/roles/dashboards/files/templar_metrics.json:27-64 , telemetry/ansible/roles/dashboards/files/loki_logs.json:28-40