PIKA: Center-Wide and Job-Aware Cluster Monitoring
Discover the Performance of Your HPC System
- Non-intrusive data acquisition
- Automatic job analysis
- Powerful interactive visualization
- Long-term data storage
- Up to 10k nodes with open-source software stack
Live and Post-Mortem Visualization
PIKA Monitoring Infrastructure
To enable a comprehensive analysis on all common HPC systems, we identified a set of metrics that covers compute performance, I/O utilization, and network traffic. The metrics have been selected according to availability on most systems as well as their value for accurate analysis of the job performance. They are collected on each compute node. A subsequent analysis of collected data enables the characterization and tagging of jobs.
The proposed job monitoring and analysis infrastructure is divided into four layers: collection, storage, analysis, and visualization. Data collection distinguishes between runtime metrics and metadata. The former are acquired directly on the compute nodes and must not noticeably influence the executed jobs. Thus, it is a critical task. Metadata can be gathered from multiple sources provided by the batch system. All collected job data is stored for a post-mortem analysis and visualization.
|Monitored Metrics||Data Source|
Instructions per Cycle (IPC)
FLOPS (SP Normalized)
Main Memory Bandwidth
Main Memory Utilization
File I/O Bandwidth & Metadata
proc & sysfs filesystems
GPU Memory Utilization
GPU Power Consumption
Please send us a message if you have any questions!
R. Dietrich, F. Winkler, A. Knüpfer and W. Nagel, “PIKA: Center-Wide and Job-Aware Cluster Monitoring,” 2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020, pp. 424-432. DOI: 10.1109/CLUSTER49012.2020.00061