Skip to main content

Alerts Detail

The Alert Summary Page in AI Ops provides a comprehensive view of alerts generated by the system.

To access the Alert Summary Page, navigate to Monitoring>Alerts, and click any alert.

Once you click on an alert, you'll be directed to a page where you can find the following details:

  • Alert Status: Indicates whether the alert is open, closed, acknowledged, or in progress.

  • Severity: Represents the level of impact or urgency associated with the alert (e.g., critical, major, minor).

  • Priority: Determines the importance or order in which the alert should be addressed.

  • Duration: Shows the time elapsed since the alert was triggered.

  • Repeat Count: Indicates the number of times the same alert has occurred within a specified timeframe.

Tabs on Alert Summary Page:

The Alert Summary Page is divided into three tabs:

  1. Overview: This tab provides a general overview of the alert, including its status, severity, priority, duration, and repeat count. It serves as a quick reference for understanding the key aspects of the alert.

  2. Troubleshooting: In this tab, you can find detailed information and insights to aid in troubleshooting the alert. This may include diagnostic data, suggested actions, and relevant contextual information to help resolve the issue efficiently.

  3. Properties: The Properties tab offers a deeper dive into the alert's properties, allowing you to access additional metadata, tags, associated resources, and any custom attributes relevant to the alert.

Overview

The Overview tab on the Alert Summary Page in AI Ops provides essential details about the alert, including what occurred, who manages the alert, unique identifiers, and specifics about the target entity associated with the alert.

To access the Overview tab, navigate to the Alert Summary Page by clicking on an alert from the alert list. Once on the Alert Summary Page, locate and click on the Overview tab to view the relevant information.

Under the Overview tab, you can find the following details about the alert and its associated target entity:

alerts_overview.png
  1. What Happened: This section provides a brief description or summary of the event or condition that triggered the alert.

    • Alert Manager: Indicates the system or tool responsible for managing and generating the alert. This helps identify the source of the alert and the associated management system.

    • Alert ID: A unique identifier assigned to the alert, which can be used for tracking, referencing, and correlating related events or actions.

    • Alert ID: A unique identifier assigned to the alert, which can be used for tracking, referencing, and correlating related events or actions.

  2. Target Entity Details

    • Entity Name: The name or label of the target entity associated with the alert, providing context about the affected component or resource.

    • Type: Indicates the type or category of the target entity (e.g., server, application, network device).

    • ID: A unique identifier assigned to the target entity, facilitating traceability and correlation with other system components.

    • Cluster-ID: If applicable, this represents the identifier of the cluster or group to which the target entity belongs.

    • Related Details: Additional information or attributes related to the target entity, such as its configuration, status, or relationships with other entities.

  3. Event Details: You can navigate to the alert summary or details page within the AIOps platform. Within this section, there is typically an option to view event details, which leads to a dedicated page displaying information about events related to the alert.

  4. Last Activity: The Last Activity section in the AIOps alert management system provides users with valuable insights into the recent actions taken on alerts. The Last Activity section categorizes activities based on their nature and origin, employing color-coded indicators for easy identification:

    • Orange: Activities executed by automated systems or assistants are highlighted in orange.

    • Blue: Actions performed by human users are denoted in blue, indicating manual intervention or response.

    • In addition to viewing activity details, users have the option to add notes to the alert, providing additional context, observations, or instructions for future reference.

  5. Related Infrastructure Details: Alert also offers valuable details about the related infrastructure components. By displaying information about the node, pod, namespace, and container associated with the alert, users can gain a deeper understanding of the alert context and expedite incident resolution within the AIOps platform.

  6. Alerted Metrics: The Alerted Metrics Graph is a visual representation of the metrics associated with an alert within the AIOps platform. This graph allows users to view detailed data over time, providing insights into the performance, behaviour, or status of the monitored system or application. The Alerted Metrics Graph presents the following details:

    • Metric Data: The graph displays the specific metric(s) that triggered the alert.

    • Time Axis: The x-axis represents time, with data points plotted over a specified time range.

    • Percentage Scale: The y-axis typically represents the percentage or value of the metric being monitored.

  7. Related alerts: "Related alert" typically refers to a secondary alert that is connected or associated with a primary alert.

Troubleshooting

Within the troubleshooting tab, you will find the Root Cause Analysis Summary. This section provides a comprehensive overview of the alert investigation process, including pertinent details such as:

TROUBLESHOOTING.png
  1. Root Cause Analysis Summary: Root cause analysis (RCA) in AIOps (Artificial Intelligence for IT Operations) troubleshooting refers to the process of identifying issues within an IT infrastructure. Fish fishbone diagram is used to identify and analyze the potential causes contributing to a problem.

    Table 4. Root Cause Analysis

    Icon

    Description

    analysis_ctegory.png

    Analysis Category: This categorizes the nature of the issue, providing insights into whether it's related to performance, connectivity, security, or other relevant areas.

    detected_cause.png

    Detected Cause: This section outlines the specific cause or causes identified during the analysis process.

    checked_no_issue.png

    Checked, No Issue Found: If no underlying issue is detected during the troubleshooting process, this status will be indicated. It suggests that the alert may have been triggered by transient factors or false positives.



  2. Show Only Detected Causes button: By enabling this button, you will exclusively see details about the detected causes, providing a clear and concise overview of the underlying issues driving the alert.

  3. Detected Cause Details: This displays a comprehensive breakdown of potential causes. Each detected cause is listed along with relevant details. Clicking on an individual detected cause reveals further information. You have the option to view metric data related to the selected cause. This includes graphical representations of metrics.

Note

In IPM troubleshooting, only recommendations and root cause details are presented, whereas in OC, both the root cause analysis (RCA) and its associated details are displayed.

Note

The Troubleshooting tab is accessible specifically for alerts marked with an insights icon.

Properties

In AIOps, alert properties play a crucial role in providing detailed information about an alert. The properties typically encompass various aspects such as the alert properties such as metric name, rule identification, alert name, entity properties, and many more details with additional data details. You can view:

PROPERTIES.png
  • Alert properties: The Alert Properties tab provides detailed information about the alert itself, including:

    • Manager: Indicates the system or tool responsible for managing and generating the alert.

    • Rule ID: Unique identifier associated with the rule or condition that triggered the alert.

    • Metric Name: Name of the metric or parameter being monitored that triggered the alert.

    • Duration: Duration of time the alert has been active or triggered.

    • Threshold: The threshold value or condition that, when met, triggers the alert.

  • Target Entity Tab: The Target Entity tab contains details about the entity or component affected by the alert. These details may include:

    • Entity Name: Name or identifier of the affected entity.

    • Type: Type or category of the affected entity (e.g., server, application, network device).

    • ID: Unique identifier assigned to the affected entity.

    • Cluster-ID: Identifier of the cluster or group to which the affected entity belongs.

    • Related Details: Additional information or attributes related to the affected entity.

  • Additional Alert Data Tab: The Additional Alert Data tab provides supplementary information related to the alert, such as:

    • First Occurrence: Timestamp indicating when the alert was first triggered.

    • Last Occurrence: Timestamp indicating when the alert was most recently triggered.

    • Type: Type or category of the alert (e.g., performance, availability, security).

    • Subtype: Subcategory or specific classification of the alert.

    • Severity: Severity level assigned to the alert (e.g., critical, major, minor).

    • Pod Name: Name of the pod or container associated with the alert (if applicable).

Actions (three dots)

In AIOps alerts, users have several actionable options:

action.png
  • Related Infrastructure: Provides information about interconnected systems, dependencies, and components associated with the alert, offering a broader understanding of its impact.

  • Event List: Offers a chronological list of events and activities related to the alert, aiding in the analysis of the alert's history and contributing factors.

  • Target Entity Properties: Details the specific properties and characteristics of the entity triggering the alert, aiding in the identification and resolution of the issue.

  • Alert Activity: This represents a log or record of actions and events associated with the alert, showcasing the timeline of responses and interventions by automated systems or human users.

  • Acknowledge: When an alert is generated within the AIOps platform, it often requires attention from relevant personnel to investigate and resolve the underlying issue. The "Acknowledge" function provides a mechanism for users to formally acknowledge their awareness of the alert without necessarily resolving it immediately.