PIKA: Center-Wide and Job-Aware Cluster Monitoring

Discover the Performance of Your HPC System

  • Non-intrusive data acquisition
  • Automatic job analysis
  • Powerful interactive visualization
  • Long-term data storage
  • Up to 10k nodes with open-source software stack

Live and Post-Mortem Visualization

PIKA Monitoring Infrastructure

 

To enable a comprehensive analysis on all common HPC systems, we identified a set of metrics that covers compute performance, I/O utilization, and network traffic. The metrics have been selected according to availability on most systems as well as their value for accurate analysis of the job performance. They are collected on each compute node. A subsequent analysis of collected data enables the characterization and tagging of jobs.

The proposed job monitoring and analysis infrastructure is divided into four layers: collection, storage, analysis, and visualization. Data collection distinguishes between runtime metrics and metadata. The former are acquired directly on the compute nodes and must not noticeably influence the executed jobs. Thus, it is a critical task. Metadata can be gathered from multiple sources provided by the batch system. All collected job data is stored for a post-mortem analysis and visualization.

Monitored Metrics Data Source
Instructions per Cycle (IPC)
FLOPS (SP Normalized)
Main Memory Bandwidth
Power comsumption

LIKWID
CPU Usage
Main Memory Utilization
Network Bandwidth
File I/O Bandwidth & Metadata

proc & sysfs filesystems
GPU Usage
GPU Memory Utilization
GPU Power Consumption
GPU Temperature

NVML

More Information

Please send us a message if you have any questions!

Contact

Frank Winkler

Publications

R. Dietrich, F. Winkler, A. Knüpfer and W. Nagel, “PIKA: Center-Wide and Job-Aware Cluster Monitoring,” 2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020, pp. 424-432. DOI: 10.1109/CLUSTER49012.2020.00061

This work was funded by the Deutsche Forschungsgesellschaft (DFG) under Grant Number NA711/15-1 (ProPE).