Skip to main content

What is Virtana AIFO?

Virtana AI Factory Observability (AIFO) is a full-stack observability platform built to monitor your AI applications and every layer that supports them. Unlike traditional monitoring tools, Virtana AIFO answers questions about why AI systems behave as they do, not just whether they are running.

This guide introduces the three layers that make up Virtana AIFO and shows you how to deploy each one. If you’re new to Virtana AIFO, start with this chapter to understand what each layer covers, then go to the chapter for the layer you want to deploy.

You can use Virtana AIFO across one or more layers depending on your needs. Each layer can be observed independently or together as a complete stack. The more data sources you connect, the more comprehensive your observability coverage becomes.

Caution

For the most complete and accurate analysis, deploy the full Virtana stack. Connecting the full stack breaks down the data silos that limit traditional observability tools and gives you the most comprehensive observability coverage.

virtana-aifo-layer-stack-no-title__1_.png

Virtana AIFO observes three independent layers: AI applications (Layer 1), orchestration (Layer 2), and infrastructure (Layer 3). Each layer can be deployed independently.

Virtana AIFO layers

Virtana AIFO organizes observability into three layers. The following table summarizes what each layer covers and the key questions it helps you answer.

Table 241. Virtana AIFO observability layers

Layer

What it covers

Key questions it helps you answer

Layer 1: AI applications

Traditional application signals, such as latency, errors, and response time

Agentic signals for applications that use AI agents, including call chains, token usage, model API behavior, and cost per request

Why is this response slow?

Which agent failed?

What did this request cost?

Layer 2: AI orchestration

Container health

Workload placement

Replica status

Scheduling queues

Is the container restarting?

Is the host under memory pressure?

Is the job queued?

Layer 3: AI infrastructure

GPU health and performance

Network interconnects

Storage throughput

Am I using my GPUs efficiently?

Are they throttled?

Is my NVLink network configured correctly?

Is network or storage congestion limiting training throughput?



Note

You don’t need to deploy all three layers at once. Virtana AIFO is designed for incremental adoption. Start with the layer that addresses your most pressing observability gap, and add others as your needs grow.

Layer 1: AI applications

Layer 1 (AI applications) provides visibility into the behavior of AI applications and the AI agents they use to handle requests. This layer captures what happens inside your AI application: which agents were called, in what order, how long each took, and what each call cost.

Layer 1 covers both inferencing and training workloads at the application level. For infrastructure-level visibility, see Layer 3: AI infrastructure.

What Virtana AIFO monitors at Layer 1

At the application layer, Virtana AIFO captures traditional signals such as latency, response time, and errors, as well as agentic signals such as token consumption, cost, and time to first token. The following table lists some of the key signals Virtana AIFO observes.

Signal

Description

Why it matters

Trace spans per call

A record of every agent call made during a request, in order

Reveals the full call chain so you can identify exactly where time was spent

Latency per hop

Time taken by each individual agent call

Pinpoints slow agents within a multi-agent chain

Token consumption

Tokens sent and received at each model API call

Identifies agents sending oversized prompts that increase cost and parse time

Cost per call path

Total token cost across a request

Enables cost attribution per request, user, or workflow

Model API errors

Error rates returned by model API calls

Surfaces reliability issues with specific models or endpoints

Time to first token

Delay between a model API call and the first token in the response

Indicates model-side latency separate from network or agent overhead

Agent handoff latency

Time spent passing context between agents in a chain

Identifies bottlenecks at agent boundaries, not inside agents

Total request cost

End-to-end cost for a complete user-facing request

Enables financial tracking and anomaly detection at the request level

Inferencing application monitoring

An inferencing application is any system that uses a trained AI model to produce an output in response to an input. Inferencing is the most common form of AI adoption: a business application delegates tasks to AI agents, which call models hosted externally as SaaS APIs or run internally on cloud or on-premises GPU infrastructure.

Inferencing requests tend to be short and frequent. A single user action can trigger dozens of agent calls within seconds. Virtana AIFO captures each call as a trace span and correlates them into a complete picture: which agents were called, in what order, how long each took, and what it cost.

virtana-aifo-layer1-diagram-clean.png

Training application monitoring

A training application builds a specialized AI model by processing large volumes of data through an iterative, compute-intensive learning process. Virtana AIFO monitors training applications at Layer 1 to give you visibility into job execution and progress.

Virtana AIFO gives you visibility into the following areas for training applications:

  • What jobs are running and how long they are taking: Track the progress and duration of training runs in real time.

  • Whether jobs are completing successfully or failing: Detect failures early and understand the point at which a job stopped progressing.

Note

Layer 1 training monitoring covers job state and progress. For GPU utilization, NVIDIA NVLink health, and storage throughput during training, see Layer 3: AI infrastructure.

Layer 2: AI orchestration

Layer 2 (AI orchestration) provides visibility into where AI workloads are deployed and whether the container and scheduling infrastructure supporting them is healthy. At this layer, Virtana AIFO surfaces the same class of problems that platform teams encounter in traditional environments, in the context of the AI agents and training jobs that depend on them.

When a slowdown or failure is identified at Layer 1 (AI applications), Layer 2 helps you determine the root cause: is the issue in your application code, your orchestration and deployment configuration, or your underlying infrastructure? For hardware-level analysis, see Layer 3: AI infrastructure.

What Virtana AIFO monitors at Layer 2

The following table lists some of the key signals available at Layer 2 for both inferencing agents and training jobs.

Signal

What Virtana AIFO shows

Common use case

Workload placement

Which host, cluster, and availability zone each workload is running on

Confirm that a slow agent is not on an overloaded host

Replica count and health

How many replicas of each workload are running and whether they are healthy

Identify whether autoscaling has responded correctly to load

Dependency health

Status of services, volumes, config maps, and secrets each workload depends on

Find misconfigured dependencies causing intermittent failures

Resource utilization

CPU, memory, and network consumption at the container and host level

Correlate resource pressure with application latency from Layer 1

Restart history

Restart count, failure reason, and timing for each container

Distinguish transient faults from systemic instability

Scheduling queue

Pending state, queue depth, and blocking conditions for training jobs

Identify delays that occur before a job starts running

Traditional infrastructure problems still apply to AI workloads

AI workloads run on the same container and scheduling infrastructure as traditional workloads. The following are common Layer 2 conditions that affect AI workload performance.

  • Container restarts. An AI agent that appears slow may be restarting frequently due to an out-of-memory (OOM) condition or a liveness probe misconfiguration. Virtana AIFO surfaces restart counts and reasons alongside agent call chain data from Layer 1.

  • Host memory pressure. A host running near its memory limit may cause containers to be throttled or evicted. GPU workloads are particularly sensitive to host memory pressure if they use host-side memory for data staging.

  • Scheduling delays. Training jobs often wait in a scheduling queue before they can run. A job that appears to be failing may in fact be queued. Virtana AIFO surfaces queue depth and pending state separately from execution state.

  • Network congestion. Container-to-container and container-to-storage network congestion can increase agent call latency and reduce training data throughput. This is visible at Layer 2 before it becomes visible at Layer 1 or Layer 3.

Note

Layer 2 signals are available for workloads running on Kubernetes. For workloads running on bare-metal hosts or other orchestrators, contact your Virtana account team for supported configurations.

Layer 3: AI infrastructure

Layer 3 (AI infrastructure) provides visibility into the physical and virtual hardware on which AI workloads run. This is where GPU efficiency, storage throughput, and network performance are monitored, and where some of the most impactful and least visible performance bottlenecks are found.

Layer 3 can be deployed independently of Layers 1 and 2. If your priority is hardware observability, such as monitoring GPU health for a training cluster, you can use Layer 3 without agent instrumentation. When all three layers are deployed, Layer 3 signals are correlated with application and orchestration signals from Layers 1 and 2.

Note

Layer 3 covers both inferencing and training workloads. The signals described in this section apply to any AI workload that runs on GPU infrastructure.

GPU health monitoring

Virtana AIFO monitors GPU health continuously for every device in your environment. The following signals are collected for all GPU workloads, regardless of workload type.

Hardware errors

Because elevated error rates are early indicators of device degradation, Virtana AIFO tracks the following GPU hardware error types per device:

  • Error-correction code (ECC) memory errors: Both correctable and uncorrectable

  • Xid errors: NVIDIA driver-level error events that indicate GPU faults

  • NVLink errors: Communication faults between GPUs within a host

Warning

A GPU with uncorrectable ECC errors should be removed from service immediately. Training jobs running on a degraded GPU may produce incorrect model weights without surfacing an explicit job failure.

Throttling state

GPU throttling occurs when a device reduces its clock speed in response to a constraint. Virtana AIFO identifies the throttling state for each device and the reason, including thermal throttling, power throttling, and clock management throttling.

Identifying the reason is essential: a thermally throttled GPU requires a cooling or placement fix, while a power-throttled GPU may require a power delivery upgrade or workload redistribution.

Temperature

When a GPU exceeds its thermal threshold, the device throttles its clock speeds to prevent damage. This throttling reduces throughput and increases latency, often without producing an explicit error or alert in standard monitoring tools.

Virtana AIFO monitors temperature per device and flags devices that are approaching throttle thresholds. This lets you identify thermal issues, such as inadequate cooling, blocked airflow, or high-density rack configurations, before they cause a measurable performance impact.

Warning

A GPU that is thermally throttling may appear healthy in standard CPU and memory metrics. Temperature and throttling state monitoring in Virtana AIFO Layer 3 is required to detect this condition.

Memory usage

GPU memory (VRAM) is a finite resource that AI workloads consume at scale. Virtana AIFO tracks memory consumption per device and per workload. Memory pressure, where a workload approaches or exceeds available VRAM, causes job eviction, out-of-memory errors, and forced model offloading. Virtana AIFO surfaces memory pressure before it causes failures, giving you time to resize workloads or redistribute load across devices.

Utilization

GPU utilization measures how much of each device’s compute capacity is actively being used. Virtana AIFO tracks utilization per device, per workload, and over configurable time windows.

Low utilization on a training job indicates that the GPU is not being supplied with enough work, which is a common symptom of a storage or data pipeline bottleneck. High utilization on an inferencing workload may indicate that the device is saturating and that latency will increase under continued load.

Tip

Correlate GPU utilization with Layer 1 agent latency data to distinguish between model-side slowness (a GPU signal) and agent-side slowness (an application signal). This correlation requires both Layer 1 and Layer 3 to be deployed.

Clock speeds

Streaming multiprocessor (SM) and memory clock speeds determine GPU compute throughput. Virtana AIFO monitors clock speeds per device and flags clock degradation, where clocks drop below expected frequencies under load without an explicit throttle event. Clock degradation under load is a sign of hardware stress, driver misconfiguration, or an underpowered device.

GPU selection and workload fit

Virtana AIFO helps you verify that the right GPU type is assigned to each workload. Inferencing and training workloads have different GPU requirements, and using the wrong type introduces inefficiency that is immediately visible when GPU utilization is correlated with job performance.

Table 242. GPU priorities by workload type

Workload type

Typical characteristics

GPU priority

Inferencing

Short duration, high frequency, low batch size

Low latency, predictable throughput, fast memory bandwidth

Training

Long duration, high throughput, large batch size

Sustained compute utilization, efficient data supply, high memory capacity



When GPU utilization is low on a training job, the GPU is not the bottleneck. Investigate storage throughput or data pipeline configuration. When utilization is consistently high on an inferencing deployment, the cluster may be undersized for the request volume.

GPU interconnect monitoring

For multi-GPU and multi-node training jobs, the connections between GPUs are as important as the GPUs themselves. Virtana AIFO monitors both intra-host and inter-host GPU connectivity.

Training network: Inter-host GPU connectivity

For multi-node training jobs, GPUs on different hosts communicate through a dedicated high-speed training network, typically InfiniBand or RDMA over Converged Ethernet (RoCE). Network congestion or degraded links on the training network directly reduce gradient synchronization throughput and slow training.

Virtana AIFO monitors inter-host network throughput, latency, and congestion events on the training network.

Warning

Training network issues are frequently invisible to application-level monitoring. A training job that is progressing slowly may appear healthy at the GPU and CPU level while the training network is the actual bottleneck. Layer 3 training network monitoring is required to detect this condition.

Training topology awareness

Virtana AIFO surfaces the topology of your training job: which GPUs are communicating with which other GPUs, and through which path (NVLink within a host, or training network across hosts). This information is used to verify that your job placement matches your hardware topology. A gradient synchronization step that crosses a slow inter-host link takes significantly longer than one that stays within a host on NVLink.

Storage throughput monitoring

Training workloads require a continuous supply of data to keep GPUs busy. If storage cannot deliver data fast enough, GPUs idle between batches, a condition known as I/O-bound training. Virtana AIFO monitors storage read throughput, referred to as DSK (disk) signals, and correlates it with GPU utilization. When GPU utilization drops in a regular pattern that aligns with storage read cycles, the training job is I/O-bound.

Tip

I/O-bound training is particularly common in on-premises deployments where training data is stored on spinning-disk NAS systems. If Virtana AIFO shows periodic GPU idle time correlated with storage reads, consider moving training data to local NVMe storage or a high-throughput parallel file system.

Infrastructure signals reference

The following table provides a complete reference for the infrastructure signals that Virtana AIFO monitors at Layer 3. The Workload column indicates whether a signal applies to model training, inferencing, or both.

Table 243. Layer 3 infrastructure signals

Signal category

Signal

Description

Workload

GPU compute

Utilization (%)

Percentage of GPU compute capacity in use

Both

GPU compute

Memory usage (MB/GB)

GPU memory consumed by active workloads

Both

GPU compute

Temperature (°C)

Device temperature. Throttle threshold varies by GPU model.

Both

GPU compute

Power draw (W)

Current power consumption per device

Both

GPU compute

Streaming multiprocessor (SM) clock speed (MHz)

Streaming multiprocessor clock frequency

Both

GPU compute

Memory clock speed (MHz)

Memory interface clock frequency

Both

GPU compute

Throttling state

Whether the device is throttled and the reason (thermal, power, or clock)

Both

GPU errors

ECC errors: correctable

Single-bit memory errors corrected by ECC. High counts indicate degrading GPU memory.

Both

GPU errors

ECC errors: uncorrectable

Double-bit memory errors not correctable by ECC. Indicates hardware failure.

Both

GPU errors

Xid errors

NVIDIA driver-level error events. Specific Xid codes map to specific fault types.

Both

Interconnect

NVLink bandwidth (GB/s)

Data transfer rate per NVLink connection between GPUs within a host

Training

Interconnect

NVLink error rate

Communication errors per link. Elevated rates indicate a degraded connection.

Training

Interconnect

Training network throughput (GB/s)

Inter-host data transfer rate across InfiniBand or RDMA over Converged Ethernet (RoCE) fabric

Training

Interconnect

Training network latency (µs)

Round-trip time between GPU hosts on the training network

Training

Storage

Read throughput (MB/s)

Storage read rate. Correlated with GPU utilization to detect I/O-bound training.

Training

Host

CPU utilization (%)

Host CPU load. Relevant for data pre-processing and agent orchestration

Both

Host

Available host memory

Available host RAM. Low availability can cause GPU-to-host memory spill.

Both