Skip to main content

How Discovery Works

The collector runs as a container on each Linux host that needs to be monitored and presents a Prometheus compatible endpoint (/metrics) to present the data for collection.

The discovery of entities, relationships, and inter-connectivity between components runs at collector start-up and then every thirty minutes thereafter. The discovered data is cached and returned with every poll to the collector to ensure accurate modeling on each cycle. Additional discoveries are performed in the background and will not block subsequent calls to /metrics.

In order to facilitate discovery, nv-hostengine must be running on the host machine and reachable from the running container. Using the DCGM API, the collector establishes connections to the nv-hostengine to scrape various metrics and topology data. Additionally the nvidia-smi command is used to scrape performance and configuration data.

The discovery of Programs is the only exception to the discovery logic. Program discovery runs continuously in the background. Every 10 seconds GPUs are scraped to determine which programs are running. Once a program is detected as running on a GPU the process id is cached and continuously checked (every 10 seconds) to capture CPU, memory, GPU and energy utilization. This data is then aggregated together and reported to IO at 1 minute aggregations. In order to map the program back to a specific pod and container name / program name the collector requires that the hosts /proc directory be mounted to the container.