Governance
Alert governance refers to the establishment and enforcement of policies and practices aimed at managing and optimizing the alert lifecycle. This includes deduplication, correlation, enrichment, Response, Suppression and other strategies to ensure that the right alerts are presented to IT operators at the right time, reducing noise and improving incident response efficiency.
Use Cases
Deduplication and Correlation: Alert Governance employs sophisticated mechanisms for deduplicating and correlating events. This strategic approach minimizes redundancy.
Enrichment: Adding relevant information to alerts to enhance their context, making it easier for IT operators to understand and respond to them.
Efficient Response Mechanisms: The primary objective of Alert Governance is to optimize response mechanisms. By streamlining the alert process, it reduces delays in identification and resolution, fostering operational efficiency.
Noise reduction: Suppression policies are designed to minimize alert noise by blocking redundant, irrelevant, or low-priority alerts when specific conditions are met. These policies are essential for enhancing the efficiency of monitoring systems, ensuring that only actionable and meaningful alerts reach your teams.
Custom and System Policies Overview
Custom Policies : Custom policies are user-defined configurations that enable organizations to tailor alert management to their specific needs. These policies encompass various functionalities to optimize the alert lifecycle:
Deduplication: Custom policies can identify and eliminate duplicate alerts, reducing noise and focusing attention on unique incidents.
Enrichment: Enriching alerts with additional context, such as relevant metrics or historical data, helps in better understanding and prioritizing issues.
Correlation: Correlating related events and alerts to identify patterns or underlying causes, improving the accuracy and speed of root cause analysis.
Aggregation: Aggregation refers to the process of grouping similar or related alerts together into a single, consolidated alert. This helps reduce the volume of alerts by combining multiple notifications that pertain to the same issue or event.
Suppression: Suppression involves preventing certain alerts from being triggered based on predefined criteria. This reduces unnecessary noise by blocking alerts that are considered redundant, irrelevant, or low-priority.
System Policies: System policies are pre-defined configurations provided by the platform to facilitate standardized and automated alert responses. These policies primarily focus on communication and automation:
Actions by Communication Tools:
Email: System policies can automatically send alert notifications via email to designated recipients.
Slack: Integration with Slack allows for real-time alert notifications and collaboration within Slack channels.
Teams: Microsoft Teams integration enables alert notifications and discussions within Teams channels.
HTTP Action Notification: Sends a notification via a custom HTTP request to a specified endpoint, enabling seamless integration with external systems.
Mattermost Notification: Sends real-time alerts to designated Mattermost channels, ensuring timely communication and collaboration.
Webex Notification: Sends notifications to Webex teams, facilitating instant communication within the Webex platform for efficient incident management.
Automation and Orchestration:
StackStorm: System policies can leverage StackStorm for automation and orchestration, enabling complex workflows to be triggered in response to specific alerts.
Navigating the Custom Dashboard:
You can explore details of policies, user activities, and leverage visualizations for quick insights. This tabular representation allows to quickly assess the details of various governance policies in one centralized view. It provides essential information such as the policy's purpose, type, last editor, edit date, and its current state. This enables efficient monitoring, management, and adjustment of governance policies as needed.
Policy Name: The unique name or identifier of the governance policy.
Priority: The priority column in the governance tab indicates the relative importance or urgency of each policy. In this system, a higher number indicates a lower priority, while a lower number signifies a higher priority.
Description: A brief description explaining the purpose or focus of the policy.
Policy Type: The type or category of the policy (e.g., Deduplication, Correlation, Enrichment).
Edited By: The user or team who last edited or modified the policy.
Edit Date: The date and timestamp when the policy was last edited or updated.
State: Indicates the current status of the policy (e.g., enable, disable).
What is YAML?
YAML serves as a flexible and human-readable data serialization format that plays a pivotal role in configuring and defining policies. It empowers users to express complex configurations and settings clearly and concisely, facilitating the seamless integration of governance rules within Global View environments.
Creating a New Policy
To create a new policy, follow these steps:
Click New Policy.
Upload YAML File Select the YAML file intended for the policy and upload it using the provided upload function.
Configuring Alert Correlation Policy Using YAML
Section 1: Creating YAML-format Configuration
Duplicate the "alert-policy.yaml" File Make a copy of the existing "alert-policy.yaml" file.
Edit the Configuration: Modify the duplicated file to configure the integration as required for your setup.
Upload the Configured File Use the provided upload feature to submit the edited YAML file containing the policy logic.
Section 2: Adding General Information
Enter the following details:
Policy Name and Description Provide a name and a description of the policy. For client-specific policies, include relevant tenant organisation names. It is not mandatory to have a Tenant ID.
policy: name : Same application description: Correlation activated because alerts generated from the same precedence: 1 category: "alert_handling"
Section 3: Add Alert Criteria
Filter the type of alerts which occur on the selected resources. If no conditions are defined in this section, all alerts on the selected resources will match this policy.
criteria: "event_provider: OpsCruise AND -status: Closed"
Section 4: Add Actions
Deduplicate - Removes duplicate alerts to minimize noise and redundancy.
actions: - type: "dedup" criteria: "status: Open AND priority: Highest" query: "event_provider: Others AND entity_id: $alert.entity_id AND entity_type: $alert.entity_type" set: severity: $alert.severity summary: $alert.entity_name
Enrichment - Adds context and metadata to alerts for a deeper understanding.
actions: - type: enrich criteria: "status: Open" set: priority: "Low" description: "Alert enrichment examle"
Correlate - Identifies and links related alerts to reveal underlying patterns.
actions: - type: "correlate" criteria: "true" elementType: "host" linkType: "parent" query: "status: Open AND entity_type: host AND entity_name: $alert.related_entities.host"
Aggregate - Groups related alerts into a single, comprehensive alert for simplified management.
actions: - type: "aggregate" query: "event_provider: Others" criteria: "summary: "WAN TEST"" parentAlert: fields: severity: "Major" matchFields: summary: "parent alert created" entity_type: "WAN_TEST" entity_id: "TEST0000006"
Section 5: Upload configured YAML-file with the policy
Examples:
Type: Correlation
name: "Aditya_Test_Policy" description: "Correlate incoming deployment alerts with other alerts." criteria: "event_provider: OpsCruise AND -status: Closed AND entity_type: deployment" precedence: 1 category: "alert_handling" actions: - type: "correlate" criteria: "summary: \"Available replicas\"" elementType: "container" linkType: "child" query: "event_provider: OpsCruise AND status: Open AND entity_type: container AND related_entities.deployment_name: $alert.entity_name AND related_entities.namespace: $alert.related_entities.namespace" batchSize: 100 - type: "correlate" criteria: "summary: \"Available replicas\"" elementType: "pod" linkType: "child" query: "event_provider: OpsCruise AND status: Open AND entity_type: pod AND related_entities.deployment_name: $alert.entity_name AND related_entities.namespace: $alert.related_entities.namespace"
Type: Enrichment
name: "Enrichment policy" description: "Enrich OC alerts." criteria: "event_provider: OpsCruise AND -status: Closed AND namespace: \"robot-shop\"" precedence: 1 category: "alert_handling" actions: - type: enrich criteria: "status: Open" set: priority: "Low" description: "QA teamm modified it for testing"
Type: Aggregate
name: "Parent Creator Rule" description: "Create new parent alert if it doesnt exist or correlate incoming alerts with existing parent alert." criteria: "event_provider: Others AND -status: Closed AND entity_type: WAN_TEST" precedence: 10 category: "alert_handling" actions: - type: "aggregate" query: "event_provider: Others" criteria: "summary: \"WAN TEST\"" parentAlert: fields: severity: "Major" matchFields: summary: "parent alert created" entity_type: "WAN_TEST" entity_id: "TEST0000006"
Type: Action
name: "Test_slack_action_policy" description: "Sample action policy" criteria: "event_provider: \"Virtana IO\" AND -status: Closed AND severity: Critical" category: "alert_response" precedence: 1 actions: - type: "action" criteria: "true" action_type: "slack" action_name: "Slack_Notification" parameters: webhook_url: "https://hooks.slack.com/services/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" message: "Received alert with entity_name - $alert.entity_name , key - $alert.key , summary - $alert.summary and severity - $alert.severity"
Field | Mandatory | Details |
---|---|---|
name | Yes | Name of the alert policy. |
description | No | Specifics regarding the policy, its execution timeline, and the actions it entails. |
criteria | No | Key rules at the top, checked for every new alert. If they match, the policy's "actions" are carried out. |
Field | Possible Values | Details |
---|---|---|
type | correlate | Mandatory field, represents type of the policy. |
description | (User can define their own values) | Short description of the correlate policy. |
criteria | User configured criteria such as Alert Severity,Event Type,Custom Tags, location and so on. | When this condition is satisfied, new alerts will be compared with previously received ones using the specified query criteria. |
linkType | parent/child | If the link type is 'parent,' the policy retrieves previously received alerts and connects the incoming alert to them as a parent. If the link type is 'child,' the policy fetches prior alerts and creates a child relationship with each one. |
query | (User can define their own values) | Solr query to fetch the ingested alerts. |
size | Default: 1 |
Retrieve and correlate the incoming alert with the count of alerts fetched using a Solr query. |
Field | Possible Values | Details |
---|---|---|
type | Drop | Mandatory field, represents type of the policy. |
description | (User can define their own values) | Short description of the suppression policy. |
criteria | User configured criteria | If this criteria meets, incoming alerts will be dropped. |
Field | Possible Values | Details |
---|---|---|
type | Enrich | Mandatory field, represents type of the policy. |
description | (User can define their own values) | Short description of the enrichment policy. |
criteria | User Configured Criteria | If this criteria meets, fields from the incoming alert will be enriched as per the assignments configured in this policy. |
set | (User can define their own values) | This is a keyword, followed by the name and value of fields from the alert which should get enriched if the policy criteria matches. |
Upload Configured YAML-file
Upload the edited YAML file containing the configured policy through the provided upload functionality.
Viewing Policy List
Access the Governance Dashboard: Navigate to the Governance Dashboard interface.
View Policies: From the dashboard, access the list of created policies.
Policy Details
Click on a Policy: Review the policy details, including its enabled or disabled status.
View Syntax: Explore the policy syntax to understand its structure and rules.
Upload Another YAML File: Replace or upload a different YAML file associated with the policy.
Modifying Policies
Upload/Replace YAML File: Click "Upload Another File". Use this option to upload a new YAML file or replace the existing one linked to the policy.
Deleting a Policy: Choose this option to permanently remove the policy from the system.
Note
If want to apply policy on specific type of alerts, then the policy criteria need to have condition added with appropriate value.
If you want to apply the policy on Virtana IO alerts, then the policy can have one of the criteria as - event_provider: Infrastructure Observability
If you want to apply the policy on OpsCruise / Application Monitoring alerts, then the policy can have one of the criteria as - event_provider: OpsCruise
If you want to apply the policy on external alerts only, then the policy can have one of the criteria as - event_provider: Others
Alert Policy Timer Support
Significance of Timer Support in Alert Policies:
Timer support allows for greater control over alert actions and notifications by introducing a delay before actions are triggered. This delay is beneficial because during the waiting period, the alert could be closed or the priority/severity might reduce. Without the timer, multiple notifications might flood the system for an issue that resolves quickly, which may not be necessary. By applying a delay, you ensure that only persistent alerts trigger notifications or actions, helping the system to avoid spamming users with alerts that could be quickly resolved.
Why is Timer Support Needed?
Action/Notification Delays: The delay ensures that alerts are only acted upon if the criteria remain met after a set period. This reduces notification flooding and allows time for the system to stabilize, potentially preventing unnecessary alerts from being sent.
Scheduled Reminders: Timer support can also trigger recurring actions or notifications on a defined schedule (e.g., every 8 or 24 hours). This feature is especially helpful in sending reminders to external systems or users that the alert remains unresolved and matches the original criteria.
Configuring Timer Schedule in Alert Policies:
You can configure timers in any alert policy by defining a timer schedule within the policy configuration. This will allow you to execute actions/notifications at a delayed time or at recurring intervals.
Mandatory Inputs
Timer interval or delay time: Schedule repetition details (single or repeated execution)
Allowed Combinations: You can specify both delayed and scheduled actions within a single policy.
Workflow of a Time-Based Policy:
Identifying Timer in Policy: When a policy contains a timer, the system checks if it's the first time the action is being executed or if it is a repeat (identified by the execution header).
Calculating Next Execution Time: If the execution is new, the system calculates the next execution time based on the timer and stores the alert's details along with the policy ID.
Skipping Execution for Delay: With the timer in place, the rest of the policy execution is skipped until the next-execution-time is breached.
Scheduler Management: A scheduler runs in the background, regularly checking the database for alerts whose next-execution-time has been reached. These alerts are then processed according to the policy.
Single vs. Repeated Schedule:
Single Execution: Actions are executed once after the delay, if the alert remains valid.
Repeated Execution: Actions are executed on a recurring schedule, reminding users or external systems that the alert is still unresolved and meets the criteria.
What Happens When Policy is Disabled or Deleted:
Policy Disabled: All scheduled timers for that policy are halted, and no further actions or notifications are triggered.
Policy Deleted: All scheduled timers and associated actions are removed from the system, and no further alerts are processed based on the deleted policy.
Sample YAML Configuration with Timer
Single Execution Timer:
name: Single Aggregate Policy description: "Test policy to check timer support in Aggregate Policy" criteria: "event_provider: OpsCruise AND substatus: Done" precedence: 1 category: "alert_handling" scheduler: - type: "repeat" frequency: "2" executeFirst: true actions: - type: "Aggregate" criteria: "true" query: "substatus: Done" elementType: "container" parentAlert: fields: severity: "Critical" matchFields: summary: "Parent Alert Created" entity_type: "$alert.related_entities.namespace"
Repeated Execution Timer:
name: "ServiceNow_create_incident_with_timer_policy" description: "Create servicenow incident action policy using scheduler delay" criteria: "event_provider: OpsCruise AND substatus: Done AND entity_type: CELLULAR_DATA_USAGE_SCHEDULE" category: "alert_response" scheduler: - type: "repeat" frequency: "15" executeFirst: true precedence: 1 actions: - type: "action" criteria: "true" action_type: "create_incident" action_name: "Create_Incident" provider_name: "Servicenow" parameters: category: "inquiry" impact: "1" priority: "1" urgency: "1" description: "Incident is created on $alert.entity_name due to $alert.summary" short_description: "Causing entity id is $alert.entity_id"
This configuration ensures that actions are delayed or scheduled in a controlled manner, minimizing unnecessary notifications and ensuring effective alert management.