Linux Policies

Before reading about these default policies, note that both the Elevated User CPU and Elevated System CPU policies assume that the CPU Collector is configured to collect aggregate CPU metrics, rather than per core metrics.

It also assumes that the metrics are being normalized. This is done by setting the percore setting set to FALSE (it is TRUE by default) and the normalize setting set to TRUE (it is FALSE by default) in your configuration file. After adjusting these settings, save the configuration file and restart the agent to apply the changes. See the Linux agent for more information.

Policy name

Duration

Condition 1

(and) Condition 2

Category

Description

Linux – CPU Threshold Exceeded

15 min

cpu.total.utilization.percent has a static threshold >95%

CRITICAL

The CPU on the SERVER instance has exceeded 95% for at least 15 minutes.

Linux – Elevated System CPU

30 min

netuitive.linux.cpu.total.system.normalized has an upper baseline deviation + a static threshold ≥ 30%

INFO

This policy will generate an Informational event when CPU usage by system processes is higher than normal, but only if the actual value is also above 30%. Customers typically don’t want to be informed of deviations in CPU behavior when the actual values are too low; you may want to tune the 30% threshold for your environment.

Linux – Elevated User CPU

30 min

netuitive.linux.cpu.total.user.normalized has an upper baseline deviation + a static threshold ≥ 50%

INFO

This policy will generate an Informational event when CPU usage by user processes is higher than normal, but only if the actual value is also above 50%. Customers typically don’t want to be informed of deviations in CPU behavior when the actual values are too low; you may want to tune the 50% threshold for your environment.

Linux – Heavy CPU Load

15 min

netuitive.linux.cpu.total.user.normalized has an upper baseline deviation + an upper contextual deviation

netuitive.linux.loadavg.05.normalized has a static threshold > 2

CRITICAL

This is a CRITICAL event indicating that the server’s CPU is under heavy load, based upon upper deviations on CPU utilization percent and the normalized loadavg.05 metric being greater than 2. Rule of thumb is that the run queue size (represented by the loadavg) should not be greater than 2x the number of CPUs.

Linux – Disk Utilization Threshold Exceeded

15 min

netuitive.linux.diskspace.*.byte_percentused has a static threshold >95%

CRITICAL

The consumed disk space on the SERVER instance has exceeded 95% for at least 15 minutes.

Linux – Heavy Disk Load

15 min

iostat.*.average_queue_length has an upper baseline deviation + an upper contextual deviation

WARNING

This is a WARNING which indicates that the disk is experiencing heavy load, but performance has not yet been impacted.

Linux – Heavy Disk Load with Slow Performance

15 min

iostat.*.await has an upper baseline deviation + an upper contextual deviation

iostat.*.average_queue_length has an upper baseline deviation + an upper contextual deviation

CRITICAL

This is a CRITICAL event which indicates that the disk is not only experiencing heavy load, but performance is suffering.

Linux – Heartbeat Check Expired

N/A

heartbeat check has not been received within the timeframe set by the TTL

WARNING

A heartbeat has not been received from a Linux Agent for at least the past five minutes. Confirm the agent is running and that there are not any network issues which may be preventing the check from reaching Metricly in time. If the issue is persistent, consider increasing the heartbeat ttl setting to allow more time for the check to reach Metricly.

Linux – Memory Utilization Threshold Exceeded

15 min

netuitive.linux.memory.utilization.percent has a static threshold > 95%

CRITICAL

This is a CRITICAL event which is raised when memory utilization exceeds 95%.

Elevated Memory Usage

30 min

netuitive.linux.memory.utilizationpercent has an upper baseline deviation + a static threshold > 50%

INFO

This policy will generate an Informational event when memory usage is higher than normal, but only if the actual value is also above 50%. Customers typically don’t want to be informed of deviations in memory usage when the actual values are too low; you may want to tune the 50% threshold for your environment.