Debugging a slow application
This use case describes how you can use Virtana observability capabilities to investigate and remediate a slow application caused by an service level agreement (SLA) breach. It demonstrates an end-to-end workflow from alert notification to automated remediation.
You can use this workflow when a business-critical application slows down during peak usage, and you must quickly determine the root cause and restore performance.
Scenario
An enterprise billing application supports revenue-critical operations. During peak hours, users begin to experience slow response times when accessing billing services.
The objective is to identify where the slowdown originates, such as the front end, service tier, database, or storage, and to remediate it with minimal business impact.
Setup
Before you begin, ensure that you are meeting the prerequisites and trigger conditions.
Prerequisites
Before you can follow this workflow, make sure the following conditions are met.
Billing service is onboarded in Virtana Global View.
Service topology is configured.
Service level agreements (SLA) and service level objectives (SLO) thresholds are defined.
Notification channels are integrated.
Automated actions are configured.
Trigger
The platform detects abnormal latency for the billing service and generates an SLA breach alert.
As part of this trigger, an SLA breach event is recorded and an email notification is sent to the on-call user or team.
Use case workflow
This use case performs the following steps.
Step 1: Receive alert notification
You receive an on-call notification that indicates the billing service is running slowly. The notification clearly identifies the affected service, provides the current SLA status to indicate compliance impact, and specifies the severity level to help assess the urgency of the issue. Additionally, the alert includes a direct link to the corresponding incident in Virtana Global View, enabling rapid access for further investigation and remediation.
For example, for the subject line: “Billing Service SLA Breach – Response Time Exceeded”. You can click the embedded link in the email to open the alert in Virtana Global View.
Step 2: Review SLA breach details
In the alert dashboard, you can review the current state of the service. The platform displays the SLA status, the impacted KPIs (latency, availability), duration, and the affected services and entities, including downstream components and related metrics.
From this view, you can confirm that the billing service has exceeded its defined SLA thresholds.
Step 3: Analyze service dependencies using topology
To understand where the problem originates, you can open the Topology sub-tab. The topology view shows the end-to-end flow for the billing service, for example: Billing API > Payment Service > Database > Storage
Any components with abnormal behavior are visually highlighted. In a typical scenario, the Payment Service may show increased latency, the database node may exhibit high I/O wait, and the storage volume may display elevated latency.
From this dependency view, you can infer that the slowdown is likely related to backend resources, rather than the front-end billing API itself.
Step 4: Investigate metrics and AI recommendations
You can open the Troubleshooting tab to view the correlated telemetry. This view surfaces information across containers, infrastructure, and network layers, including CPU, memory, and pod utilization, disk IOPS and latency, network throughput and error rates, and historical trends that can be compared against baseline performance.
AI-driven recommendations are presented in the context of the current incident. For example, it is a recommendation: “High storage latency detected on database volume. Consider migrating to a higher-performance tier or rerouting traffic.”
From this data, you can validate that the storage latency is elevated on the database volume, and the storage issue is contributing significantly to the overall application slowdown.
Step 5: Notify the operations team
To ensure cross-team coordination, you have to initiate a collaboration action. From the troubleshooting view, you can select Send Notification in Execute Action.
Virtana Platform then sends an automated message to the configured Slack channel (or equivalent). For example, a message might be: “Billing Service SLA breach detected. Root cause: high DB storage latency. Investigating remediation.”
This step brings the wider operations , database, and storage teams into the loop, and provides a single, consistent summary of the incident and suspected cause.
Step 6: Execute remediation action – reroute traffic
Based on the AI recommendation and validated metrics, you can choose an automated remediation action. From the same view, you can select Reroute Traffic in Execute Action.
The platform then executes the predefined automation workflow, which can include:
Redirecting traffic to a secondary database replica
Shifting workload to an alternate cluster or availability zone
Activating a failover route or using a higher-performance storage tier
On completion, the platform validates the result and updates the action status, for example, "Status: Action completed successfully." This reduces manual intervention and accelerates recovery.
Step 7: Monitor recovery and confirm resolution
After rerouting and remediation, you can monitor the service health in Global View. You can view the latency trends returning toward baseline, the SLA status transitioning from breached back to normal, and associated alerts moving toward a resolved state.
When conditions return within defined thresholds, the incident is automatically marked as resolved.
Results
After this workflow completes:
Billing service performance is restored to acceptable levels.
SLA compliance is re-established.
The incident record is updated.
Root cause and remediation details are documented for future reference.
Historical data is retained to support trend analysis and continuous improvement.
Benefits
The key benefits of this use case are as follows:
Area | Value |
|---|---|
MTTR (mean time to repair) | Faster issue detection and resolution |
Visibility | End-to-end service and dependency insight |
Collaboration | Integrated Slack notifications and shared context |
Automation | Reduced manual intervention and repeatable remediation |
Reliability | Improved SLA compliance and more consistent performance |
Best practices
To get the most value from this workflow, consider the following practices:
Regularly review SLA and SLO thresholds.
Continuously validate and update topology data.
Maintain and refine automation runbooks.
Capture lessons learned after major incidents.
Summary
Using Virtana Platform, you can quickly identify SLA breaches, analyze service dependencies, validate root causes, collaborate in real time, and execute automated remediation actions. This ensures minimal business impact and consistent application performance.