AIFO Collector Deployment

The AI Factory Observability (AIFO) Collector is a component within the Virtana platform designed to collect GPU metrics from NVIDIA hardware in Kubernetes environments. It works by connecting to the NV-HostEngine running on target GPU hosts and exposing the collected metrics for the Virtana AIFO Translator to scrape.

It gathers real-time GPU metrics from NVIDIA hosts and feeds them into the Virtana platform's monitoring pipeline. This enables you to monitor GPU utilization and performance across their Kubernetes clusters, identify bottlenecks or underutilized resources in AI/ML training and inference workloads, and integrate GPU observability into their broader Virtana infrastructure monitoring

This guide covers installing and upgrading the Virtana AIFO Collector in a Kubernetes environment.

Prerequisites

Before deploying, add the Virtana Helm repository to your local Helm configuration and verify that the latest chart version is available.

Enter the following command to add the Virtana Helm repository:

helm repo add virtana-repo https://virtana.gitlab.io/helm-charts

Enter the following command to check the latest available version:

helm search repo virtana-repo/virtana-io-south

Deployment

Choose the deployment path that matches your environment. Use the default path if the standard ports work for your setup, or the custom path if you need to override them.

Deploy with default values

Use this option when NV-Hostengine is on port 5555, and you are happy for the collector to expose metrics on port 9400. No configuration file is needed.

helm upgrade --install virtana-io-south virtana-repo/virtana-io-south \
  --namespace virtana-io-south \
  --create-namespace \
  --version <LATEST_VERSION>

Helm will create the virtana-io-south namespace automatically and deploy the collector with all default settings. The command also handles future upgrades. Re-run it with a newer --version value whenever you want to upgrade.

Deploy with custom port configuration

If you need to change the default ports, create a values override file first. To override the default values, create a vio_south_values_override.yaml file in your working directory using the command below, then update the values to match your environment.

virtana-aidc-nvidia-gw:
  ports:
    host_port: 9400        

  env:
    NVHENGINE_PORT: "5555"        
    DCGM_TOPOLOGY_ENABLED: "true"

Table 97.

Field	Description
`host_port`	Port exposed on the GPU host node. The AIFO translator must be able to reach this port to scrape metrics.
`NVHENGINE_PORT`	Port on which the `nv-hostengine` is listening on the GPU host. Change this if you started HostEngine on a non-default port.
`DCGM_TOPOLOGY_ENABLED`	Enables DCGM topology reporting. Leave as `true` unless advised otherwise.

Use the following command to deploy with the override file:

helm upgrade --install virtana-io-south virtana-repo/virtana-io-south \
  --namespace virtana-io-south \
  --create-namespace \
  --version <LATEST_VERSION> \
  -f ./vio_south_values_override.yaml

In this section:

AIFO Collector Deployment

Prerequisites

Deployment

Deploy with default values

Deploy with custom port configuration

Search results