AIFO Collector Deployment
The AI Factory Observability (AIFO) Collector is a component within the Virtana platform designed to collect GPU metrics from NVIDIA hardware in Kubernetes environments. It works by connecting to the NV-HostEngine running on target GPU hosts and exposing the collected metrics for the Virtana AIFO Translator to scrape.
It gathers real-time GPU metrics from NVIDIA hosts and feeds them into the Virtana platform's monitoring pipeline. This enables you to monitor GPU utilization and performance across their Kubernetes clusters, identify bottlenecks or underutilized resources in AI/ML training and inference workloads, and integrate GPU observability into their broader Virtana infrastructure monitoring
This guide covers installing and upgrading the Virtana AIFO Collector in a Kubernetes environment.
Prerequisites
Before deploying, add the Virtana Helm repository to your local Helm configuration and verify that the latest chart version is available.
Enter the following command to add the Virtana Helm repository:
helm repo add virtana-repo https://virtana.gitlab.io/helm-charts
Enter the following command to check the latest available version:
helm search repo virtana-repo/virtana-io-south
Deployment
Choose the deployment path that matches your environment. Use the default path if the standard ports work for your setup, or the custom path if you need to override them.
Use this option when NV-Hostengine is on port 5555, and you are happy for the collector to expose metrics on port 9400. No configuration file is needed.
helm upgrade --install virtana-io-south virtana-repo/virtana-io-south \ --namespace virtana-io-south \ --create-namespace \ --version <LATEST_VERSION>
Helm will create the virtana-io-south namespace automatically and deploy the collector with all default settings. The command also handles future upgrades. Re-run it with a newer --version value whenever you want to upgrade.
If you need to change the default ports, create a values override file first. To override the default values, create a vio_south_values_override.yaml file in your working directory using the command below, then update the values to match your environment.
virtana-aidc-nvidia-gw:
ports:
host_port: 9400
env:
NVHENGINE_PORT: "5555"
DCGM_TOPOLOGY_ENABLED: "true" Field | Description |
|---|---|
| Port exposed on the GPU host node. The AIFO translator must be able to reach this port to scrape metrics. |
| Port on which the |
| Enables DCGM topology reporting. Leave as |
Use the following command to deploy with the override file:
helm upgrade --install virtana-io-south virtana-repo/virtana-io-south \ --namespace virtana-io-south \ --create-namespace \ --version <LATEST_VERSION> \ -f ./vio_south_values_override.yaml