Self-healing system

This use case shows how you can use Virtana Platform to implement policy-based automated recovery and build a self-healing operations model. You define alert-driven policies, orchestrate multi-step workflows, and integrate with external systems to achieve transparent, auditable, and extensible remediation.

You can use this workflow when you want to reduce manual intervention, shorten incident duration, and improve SLA compliance by automatically reacting to known failure patterns.

Scenario

An enterprise production environment experiences recurring performance issues caused by infrastructure saturation, application overload, and resource contention. This approach leads to a delayed recovery, a higher risk of SLA violations, and inconsistent remediation steps.

To address this, the operations team implements automated recovery policies in Virtana Platform that trigger predefined actions when failure conditions occur.

Platform capabilities

Virtana Platform provides the following capabilities for automated recovery:

Policy-based alert intelligence and response automation
Support for scripts and multi-step workflows
Centralized governance and auditing
LLM-assisted policy authoring
YAML-based advanced configuration

Setup

This scenario will use Alerts and Policy Configuration. Before you can follow this use case, make sure you have performed the following prerequisites.

Alert condition

In this scenario, CPU utilization is exceeding 90% for a sustained period of five consecutive minutes. This condition affects production billing application hosts and is treated as a high‑severity issue, requiring immediate attention to prevent service degradation.

Table 2. Alert Configuration Table

Field	Value
Condition	CPU utilization > 90% for 5 consecutive minutes
Scope	Production billing application hosts
Severity	High

When this condition is met, the monitoring system raises a high-severity alert that Virtana Platform evaluates against your automation policies.

Automation policy configuration

Policies are created in the Governance → Alert Responses section.

You can choose one of two methods for the policy configuration:

LLM-assisted policy authoring
You can create policies by providing a natural language prompt, such as, “Create a policy to restart the VM and notify Microsoft Teams when CPU exceeds 90% for 5 minutes.” The platform automatically generates the policy logic.
This approach is useful when you want to quickly bootstrap policies without writing configuration by hand.
YAML-based policy definition
For more advanced or large-scale environments, you can manually define policies in a structured YAML format. You can express complex conditions and branching logic or reuse snippets across environments in YAML, then upload it in the Alert Responses configuration.

Trigger

A sustained CPU overload on a billing application host triggers conditions that exceed the defined thresholds, prompting the monitoring system to generate a high‑severity alert. This alert is then evaluated by the automation policy engine to decide the best response.

Automated recovery workflow

To execute the use case, perform the following steps.

Step 1: Detect alert and evaluate policies

Virtana detects sustained CPU saturation and correlates it with the potential impact on service performance, in this case, possible increased response times or error rates on a production billing host.

The alert intelligence engine evaluates all active policies to identify one that matches the detected condition. Once a match is found, the system updates the status to show that the policy has been triggered.

Step 2: Send automated notifications

The system sends a notification, such as a Microsoft Teams notification to inform users of the issue, such as the message “High CPU detected on host‑prod‑07. Automated recovery initiated.” A webhook then automatically creates a ticket in the ITSM system to ensure the incident is tracked and addressed promptly.

Step 3: Run orchestrated remediation workflow

Virtana Platform orchestration engine runs a predefined workflow, which can be simple or multi-stage, depending on your environment and risk tolerance.

Example Workflow:

Validate host status
Capture diagnostic snapshot
Restart the affected virtual machine
Verify service availability
Update ticket status

Step 4: Initiate approval workflow (optional)

For critical systems, you might use a gated workflow that requires explicit approval. Suggested steps include:

Open a change request in your change management system.
Wait for the manager's approval.
On approval, execute the remediation workflow.
Notify stakeholders of completion and outcomes.

This process ensures proper governance and compliance.

Step 5: Take action based on the root cause

Depending on the issue detected, the system automatically initiates the most appropriate remediation actions. Examples of actions include:

Issue	Automated action
Disk full	Expand NetApp volume
CPU overload	Restart the VM or scale the cluster
Memory leak	Restart the affected service

Step 6: Auditing and governance tracking

All automation activities are recorded in Reports>Executed Actions in Global View, where you can view such details as:

Which policy triggered
Which actions ran and in what sequence
Success or failure status for each step
Who approved, if the workflow is gated
Links to related alerts, tickets, or incidents

This audit trail supports governance, compliance, and continuous improvement of your automation strategy.

Benefits

These capabilities provide several key benefits, including faster incident recovery, reduced manual intervention, consistent remediation across systems, improved SLA compliance, and strong governance with full traceability.

Technical benefits include:

Simplicity of policy definition: No-code/low-code creation, LLM-assisted configuration, and reusable templates.
Integration capability: Native ITSM and collaboration tools, webhooks, APIs, and plugin ecosystem.
Orchestration Depth: Single-step scripts, multi-stage workflows, approval-based automation, and conditional branching.

Best practices

To safely adopt and scale automated recovery, consider the following practices:

Start with notification-only policies
Gradually enable auto-remediation
Test policies in staging
Maintain approval workflows for Tier-1 systems
Review reports regularly.

Summary

Virtana Global View uses policy‑based automation to deliver intelligent, transparent, and auditable recovery actions. By combining AI‑assisted policy authoring, flexible integrations, and strong governance controls, the platform enables teams to move from reactive operations to reliable, self‑healing systems.

In this section: