Self-healing system
This use case shows how you can use Virtana Platform to implement policy-based automated recovery and build a self-healing operations model. You define alert-driven policies, orchestrate multi-step workflows, and integrate with external systems to achieve transparent, auditable, and extensible remediation.
You can use this workflow when you want to reduce manual intervention, shorten incident duration, and improve SLA compliance by automatically reacting to known failure patterns.
Scenario
An enterprise production environment experiences recurring performance issues caused by infrastructure saturation, application overload, and resource contention. This approach leads to a delayed recovery, a higher risk of SLA violations, and inconsistent remediation steps.
To address this, the operations team implements automated recovery policies in Virtana Platform that trigger predefined actions when failure conditions occur.
Platform capabilities
Virtana Platform provides the following capabilities for automated recovery:
Policy-based alert intelligence and response automation
Support for scripts and multi-step workflows
Centralized governance and auditing
LLM-assisted policy authoring
YAML-based advanced configuration
Setup
This scenario will use Alerts and Policy Configuration. Before you can follow this use case, make sure you have performed the following prerequisites.
Alert condition
In this scenario, CPU utilization is exceeding 90% for a sustained period of five consecutive minutes. This condition affects production billing application hosts and is treated as a high‑severity issue, requiring immediate attention to prevent service degradation.
Field | Value |
|---|---|
Condition | CPU utilization > 90% for 5 consecutive minutes |
Scope | Production billing application hosts |
Severity | High |
When this condition is met, the monitoring system raises a high-severity alert that Virtana Platform evaluates against your automation policies.
Automation policy configuration
Policies are created in the Governance → Alert Responses section.
You can choose one of two methods for the policy configuration:
LLM-assisted policy authoring
You can create policies by providing a natural language prompt, such as, “Create a policy to restart the VM and notify Microsoft Teams when CPU exceeds 90% for 5 minutes.” The platform automatically generates the policy logic.
This approach is useful when you want to quickly bootstrap policies without writing configuration by hand.
YAML-based policy definition
For more advanced or large-scale environments, you can manually define policies in a structured YAML format. You can express complex conditions and branching logic or reuse snippets across environments in YAML, then upload it in the Alert Responses configuration.
Trigger
A sustained CPU overload on a billing application host triggers conditions that exceed the defined thresholds, prompting the monitoring system to generate a high‑severity alert. This alert is then evaluated by the automation policy engine to decide the best response.
Automated recovery workflow
To execute the use case, perform the following steps.
Step 1: Detect alert and evaluate policies
Virtana detects sustained CPU saturation and correlates it with the potential impact on service performance, in this case, possible increased response times or error rates on a production billing host.
The alert intelligence engine evaluates all active policies to identify one that matches the detected condition. Once a match is found, the system updates the status to show that the policy has been triggered.
Step 2: Send automated notifications
The system sends a notification, such as a Microsoft Teams notification to inform users of the issue, such as the message “High CPU detected on host‑prod‑07. Automated recovery initiated.” A webhook then automatically creates a ticket in the ITSM system to ensure the incident is tracked and addressed promptly.
Step 3: Run orchestrated remediation workflow
Virtana Platform orchestration engine runs a predefined workflow, which can be simple or multi-stage, depending on your environment and risk tolerance.
Example Workflow:
Validate host status
Capture diagnostic snapshot
Restart the affected virtual machine
Verify service availability
Update ticket status
Step 4: Initiate approval workflow (optional)
For critical systems, you might use a gated workflow that requires explicit approval. Suggested steps include:
Open a change request in your change management system.
Wait for the manager's approval.
On approval, execute the remediation workflow.
Notify stakeholders of completion and outcomes.
This process ensures proper governance and compliance.
Step 5: Take action based on the root cause
Depending on the issue detected, the system automatically initiates the most appropriate remediation actions. Examples of actions include:
Issue | Automated action |
|---|---|
Disk full | Expand NetApp volume |
CPU overload | Restart the VM or scale the cluster |
Memory leak | Restart the affected service |
Step 6: Auditing and governance tracking
All automation activities are recorded in Reports>Executed Actions in Global View, where you can view such details as:
Which policy triggered
Which actions ran and in what sequence
Success or failure status for each step
Who approved, if the workflow is gated
Links to related alerts, tickets, or incidents
This audit trail supports governance, compliance, and continuous improvement of your automation strategy.
Benefits
These capabilities provide several key benefits, including faster incident recovery, reduced manual intervention, consistent remediation across systems, improved SLA compliance, and strong governance with full traceability.
Technical benefits include:
Simplicity of policy definition: No-code/low-code creation, LLM-assisted configuration, and reusable templates.
Integration capability: Native ITSM and collaboration tools, webhooks, APIs, and plugin ecosystem.
Orchestration Depth: Single-step scripts, multi-stage workflows, approval-based automation, and conditional branching.
Best practices
To safely adopt and scale automated recovery, consider the following practices:
Start with notification-only policies
Gradually enable auto-remediation
Test policies in staging
Maintain approval workflows for Tier-1 systems
Review reports regularly.
Summary
Virtana Global View uses policy‑based automation to deliver intelligent, transparent, and auditable recovery actions. By combining AI‑assisted policy authoring, flexible integrations, and strong governance controls, the platform enables teams to move from reactive operations to reliable, self‑healing systems.