Best practices for Monitoring policies

Security and navigation

PERMISSIONS Permission to manage the ComStore, view devices, and manage sites. Refer to Permissions.

NAVIGATION Account > Policies

NAVIGATION Sites > select a site > Policies

NAVIGATION Sites > select a site > Devices > select a device > Policies

Hardware: HP Server ILO (Array)

The goal of monitoring ILO via SNMP is to get enough information to indicate there is an issue but not to overrun the service desk with information about the issue.

NOTE SNMP monitoring should be as simple as possible, only reporting on the status of the key hardware. It’s very easy to over-complicate SNMP monitoring and, in turn, cause an inordinate level of alerts to be generated when a single alert would be sufficient to indicate that there is a hardware problem.

The following criteria are monitored in this policy:

Offline alert

If the ILO interface is either unplugged or otherwise stops responding, the monitoring you had set up will not be returning data. It’s important to know when your monitoring has been interrupted, especially at the hardware level.

Overall system status

This monitor covers the general health of the hardware and is triggered when any component enters a warning or error state.

Memory status

This monitor covers the health of the memory hardware. We are looking for memory errors.

NOTE It is not monitoring memory utilization. That would be an operating system monitor.

Thermal status

This monitor checks for thermal alarms based on the CPU and chassis temperature sensor readings.

Fan status

This monitor covers any failed fans in the chassis.

Fault tolerant PSU status

Rather than checking each power supply unit by item, we just need to know when the PSU is no longer redundant. For instances where only a single power supply exists, a failure would cause the system to go offline; therefore, monitoring by PSU instance is not required.

RAID controller status

This monitor is reading the current RAID controller status. The RAID controller knows what disks are attached and what RAID arrays are configured and knows when there is an issue. All we need to do is read the controller status and alert when there is an issue. The engineers will connect to the server anyway to see what is happening, so reporting on the status of each disk is not required.

Windows: APC PowerChute Event Log Monitor

Not all APC UPS devices have a network card; therefore, they need to be monitored over USB. The APC PowerChute software needs to be installed and configured to communicate with the UPS device.

IMPORTANT Ensure that the software is also configured to write to the Windows Event Logs.

See below for information about the monitored event IDs.

Event IDs

Event IDs 2003, 2037, 2040, 2044, 3000, 3001, 3003, 3005, 3006, 3014, 3015, 3016, 3017, 3018, 3020, 3021, 3022, 3031, 3103, 3104, 3105, 3106, 3107, 3110, 3111, 3120, 3121

Type	Error or Warning
Log	Application
Source	APCPBEAgent

Windows: Dell Server Event Logs

Some Dell servers ship without an iDRAC interface. Monitoring the hardware of such servers needs to be achieved using the Dell software and the Windows Event Logs.

IMPORTANT You must install all the Dell server management software and drivers for the events to be written to the Windows Event Logs.

See below for information about the monitored event IDs.

Event IDs

All event IDs except 1000, 2131, 2132, 1012, 2189, 2242, 2335, 0

Type	Error or Warning
Log	System
Source	Server Administrator

Windows: HP Server Event Logs

Occasionally, HP servers do not have an ILO interface. In such situations, you must fall back to monitoring the server using the HP software and the Windows Event Logs.

IMPORTANT It is very important that you install all the HP management software and device drivers; otherwise, hardware events will not be posted to the Windows Event Logs.

See below for information about the monitored event IDs.

Event IDs

All event IDs

Type	Error or Warning
Log	System
Source	CPQTeamMP
Issue	CPQTeamMP is responsible for the hardware teaming of the network card. If there is an error with the team (for example, the link is down), these are the event logs that will report the issue.
Impact	Network communications will be either down or limited by a link or teaming issue.
Resolution	Check the event log and take appropriate action.

All event IDs

Type	Error or Warning
Log	System
Source	Foundation Agents
Issue	Foundation Agents monitor the health of the HP server hardware and report issues directly to the Windows Event Log. Check the event logs for the nature of the issue being reported.

All event IDs

Type	Error or Warning
Log	System
Source	HP System
Issue	HP System monitors the overall health of the server hardware.

All event IDs

Type	Error or Warning
Log	System
Source	HP System Management Homepage
Issue	HP System monitors the overall health of the server hardware.

All event IDs

Type	Error or Warning
Log	System
Source	HP Wbem Dump
Issue	HP System monitors the overall health of the server hardware.

All event IDs

Type	Error or Warning
Log	System
Source	HpCISSs2
Issue	HP System monitors the overall health of the server hardware.

All event IDs

Type	Error or Warning
Log	System
Source	hpqmgmt
Issue	HP System monitors the overall health of the server hardware.

Windows: Role - Exchange Server

This policy is used to monitor the general health of the Exchange server including the Information Store size. See below for information about the monitored event IDs.

NOTE The policy's Windows Performance Monitors are used for performance graphing. These monitors do not alert.

NOTE The Exchange services are monitored using the server operating system monitor; therefore, they are not required in this Monitoring policy. Refer to Windows: Server.

You can also monitor the message queue length using the policy's performance monitors. High numbers of messages in the queue would indicate either inbound or outbound spam. Inbound spam would be a sales driver to supply your customer with an anti-spam solution. Outbound spam would be an indicator that there is an internal machine infected with a spambot.

TIP Don’t set the monitor value too low; otherwise, you will trigger lots of alerts for normal queue levels.

Event IDs

Event ID 8206

Type	Error
Log	Application
Source	MSExchangeFBPublish
Further Information	Event ID: 8206

Event ID 1003

Type	Error
Log	Application
Source	MSExchangeIS
Further Information	Event ID: 1003
Issue	The disk is full, Exchange is shutting down.
Resolution	Clear disk space and start Exchange Information Store again.

Event ID 1112

Type	Warning
Log	Application
Source	MSExchangeIS
Further Information	Event ID: 1112
Issue	The database (named in the error) has reached its maximum allowed size.
Impact	No new mail can be sent or received.
Resolution	Delete items and then shrink the database using an offline defrag using eseutil. Alternatively, create an archive database and move old items to the archive.

Event ID 1113

Type	Error
Log	Application
Source	MSExchangeIS
Further Information	Event ID: 1113
Issue	The disk that holds the Exchange log files is full.
Impact	No new mail can be sent or received.
Resolution	Clean up disk space on the log volume or, alternatively, move the log files to a disk with more free space.

Event ID 5000

Type	Error
Log	Application
Source	MSExchangeIS
Further Information	Event ID: 5000
Issue	The Information Store cannot start.
Impact	Exchange is not running; therefore, email is down.
Resolution	Other events will be logged that suggest why the Information Store cannot start. Check for events logged about the same time as the Information Store attempted to start.

Event ID 1159

Type	Error
Log	Application
Source	MSExchangeIS
Further Information	Event ID: 1159
Issue	An outside process has errors accessing the Exchange database. The event logged will detail the process name.
Impact	Depends on the process that has errors.
Resolution	Check the event log to see which process has errors accessing the database.

Event ID 9690

Type	Error
Log	Application
Source	MSExchangeIS
Further Information	Event ID: 9690
Issue	The database is above the maximum size limit.
Impact	Exchange is down.

Event ID 9688

Type	Warning
Log	Application
Source	MSExchangeIS
Further Information	Event ID: 9688

Event ID 1005

Type	Error
Log	Application
Source	MSExchangeSA
Further Information	Event ID: 1005
Issue	An error occurred but at this point we don't know what. Check the application event log for more information.
Impact	Depends on the actual error.

Event ID 1

Type	Error
Log	Application
Source	WSH
Issue	This is an IIS error that indicates a problem with a local hosted website (for example, Outlook Web Access).
Impact	The local hosted website may not be working.
Resolution	Restart IIS.

All event IDs

Type	Error or Warning
Log	Application
Source	CDOEXM
Issue	CDOEXM are Exchange Collaboration objects and an issue with one or more may mean that calendars, etc. are not functioning correctly.
Resolution	Depends on the exact issue reported in the event log.

All event IDs

Type	Error or Warning
Log	Application
Source	IISADMIN
Issue	An issue with IIS could mean all or part of Exchange is not working. Older versions of Exchange use IIS to present data to clients.
Impact	Outlook may not connect to Exchange or certain information may not be available.
Resolution	Depends on the exact issue reported in the event log.

All event IDs

Type	Error or Warning
Log	Application
Source	IMAP4SVC
Issue	IMAP is an older email protocol that is not often used anymore but, if it is, we need to check for IMAP errors on the Exchange server.
Impact	Clients using IMAP to receive and send email may not be functioning.
Resolution	Depends on the exact issue reported in the event log.

All event IDs

Type	Error or Warning
Log	Application
Source	MSExchangeDSAccess
Issue	This is an integral part of Exchange and any errors or warnings need to be investigated.
Resolution	Depends on the exact issue reported in the event log.

Event ID 8213

Type	Error
Log	Application
Source	MSExchangeFBPublish
Further Information	Event ID: 8213
Issue	The Exchange System Attendant service is having trouble creating a connection / session to a mailbox.
Impact	Maintenance routines may not be running against a mailbox / auto-replies may not be working.
Resolution	Depends on the exact issue reported in the event log.

All event IDs except 9327, 9320, 9386, 5008, 9040, 9325

Type	Error or Warning
Log	Application
Source	MSExchangeSA
Issue	The Exchange System Attendant takes care of Exchange automation, such as rule processing and auto-replies.
Impact	Various, depending on the exact issue reported in the event log.
Resolution	Depends on the exact issue reported in the event log.

All event IDs

Type	Error or Warning
Log	Application
Source	POP3SVC
Issue	POP3 is used by older email clients. If the POP3 service has errors, the email client may experience issues.
Resolution	Depends on the exact issue reported in the event log.

All event IDs

Type	Error or Warning
Log	Application
Source	SMTPSvc
Issue	SMTP is used by Exchange to send and receive emails. If there are issues with the SMTP service, the email flow will not be working.
Resolution	Depends on the exact issue reported in the event log.

Event ID 231

Type	Error
Log	Application
Source	ExchangeStoreDB
Further Information	Event ID: 231
Issue	The Exchange Store has exceeded its maximum size and will not mount.
Impact	Exchange is down.

Event ID 15002

Type	Warning
Log	Application
Source	MSExchangeTransport
Further Information	Event ID: 15002
Issue	Server is almost out of disk space. Email processing has stopped.
Impact	Exchange has stopped processing email. Email is not being sent or received.
Resolution	Free up disk space and restart the Transport service.
Note	NOTE Back Pressure is Microsoft’s term for load. While it can be measured as part of the Exchange policy, generally the only way to resolve back pressure alerts is either to upgrade to more capable hardware or reduce server load. An Exchange server that enters this state may refuse new connections, causing issues with email delivery.

Event ID 15004

Type	Error
Log	Application
Source	MSExchangeTransport
Further Information	Event ID: 15004
Issue	Server is out of resources.
Impact	Exchange has stopped processing email. Email is not being sent or received.
Resolution	Free up resources and restart the Transport service.
Note	NOTE Back Pressure is Microsoft’s term for load. While it can be measured as part of the Exchange policy, generally the only way to resolve back pressure alerts is either to upgrade to more capable hardware or reduce server load. An Exchange server that enters this state may refuse new connections, causing issues with email delivery.

Event ID 15006

Type	Error
Log	Application
Source	MSExchangeTransport
Further Information	Event ID: 15006
Issue	Server is almost out of disk space. Email processing has stopped.
Impact	Exchange has stopped processing email. Email is not being sent or received.
Resolution	Free up disk space and restart the Transport service.
Note	NOTE Back Pressure is Microsoft’s term for load. While it can be measured as part of the Exchange policy, generally the only way to resolve back pressure alerts is either to upgrade to more capable hardware or reduce server load. An Exchange server that enters this state may refuse new connections, causing issues with email delivery.

Event ID 24

Type	Error
Log	Application
Source	MSExchange Web Services
Further Information	Event ID: 24
Issue	Exchange certificate issue
Note	NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients.

Event ID 25

Type	Error or Warning
Log	Application
Source	MSExchange Web Services
Issue	Exchange certificate has expired or will expire soon.
Note	NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients.

Event ID 26

Type	Warning
Log	Application
Source	MSExchange Web Services
Further Information	Event ID: 26
Note	NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients.

Event ID 12015

Type	Error
Log	Application
Source	MSExchangeTransport
Further Information	Event ID: 12015
Issue	A certificate has expired. Use the thumbprint to show which certificate has expired.
Resolution	If a certificate is not used anymore, it should be detached from services and not just abandoned as this will cause this error. People often assign new certificates to services but fail to remove the old ones. This is the most common cause for this event.
Note	NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients.

Event ID 12018

Type	Warning
Log	Application
Source	MSExchangeTransport
Further Information	Event ID: 12018
Issue	A certificate will expire soon. This may log every hour or each time the Transport service starts. This warning cannot be disabled.
Resolution	Replace the certificate as soon as possible. You do not have to wait for it to expire before you replace it, as any remaining time can be added to the new certificate if the new certificate request has been generated from the current server and the certificate has been requested from the same vendor.
Note	NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients.

Windows: Role - SQL Server

Monitoring SQL servers can generate a lot of noise depending on how well the SQL instance has been configured. It may, for example, run the server out of RAM if SQL has not been capped, causing errors to be raised by this monitor. Therefore, before using this policy, ensure that the SQL server is properly configured.

See below for information about the monitored event IDs.

NOTE The policy's Windows Performance Monitors are used for performance graphing. These monitors do not alert.

Event IDs

All event IDs

Type	Error or Warning
Log	Application
Source	MSSQLSERVER

All event IDs

Type	Error or Warning
Log	Application
Source	SQL Server Report Service

All event IDs

Type	Error or Warning
Log	Application
Source	SQLBrowser

All event IDs

Type	Error or Warning
Log	Application
Source	SQLDUMPER

All event IDs

Type	Error or Warning
Log	Application
Source	SQLSERVERAGENT

All event IDs

Type	Error or Warning
Log	Application
Source	SQLVDI

All event IDs

Type	Error or Warning
Log	Application
Source	SQLWRITER

Windows: Server

This policy was set up to work alongside any other policy you might apply for the various roles the server is handling. The goal of this policy is to pay attention to the inner workings and health of the server itself, rather than to any of its duties.

NOTE The policy's Windows Performance Monitors are used for performance graphing. These monitors do not alert.

The following criteria are monitored in this policy:

Windows service monitoring

Of note in this policy is the Event Log Monitor which is configured to look for specific codes from the Windows Service Control Manager (SCM). These codes are raised when the SCM cannot start a particular service due to an error.

Event IDs

Event ID 7000

Type	Error
Log	System
Source	Service Control Manager
Further Information	Event ID: 7000
Issue	A service <named> failed to start.

Event ID 7013

Type	Error
Log	System
Source	Service Control Manager
Further Information	Event ID: 7013
Issue	The username or password configured for a service is incorrect. The service failed to start.

Event ID 7038

Type	Error
Log	System
Source	Service Control Manager
Further Information	Event ID: 7038
Issue	The username or password configured for a service is incorrect. The service failed to start.

The Event Log Monitor works best when used alongside the Stopped Auto-Start Services Monitor component available for free from the ComStore. We would recommend that all users applying this policy perform this second step to facilitate full service monitoring. This additional Component Monitor does not need to be configured to raise tickets as some services are intended to start and stop automatically (for example, Windows Update).

The Component Monitor can be configured as follows:

Reboot required

Servers may need a reboot for various reasons, such as patch installs, software changes, etc. Therefore, we would also recommend that you use the policy alongside the Reboot Required Monitor component. It's available for free from the ComStore and can be configured as follows:

The monitor checks all the areas that indicate a reboot is required but only every four hours. It’s not necessary to check more frequently.

TIP It is recommended to configure an alert email with the device’s hostname in the subject.

Bugcheck monitoring

A "bugcheck" is Microsoft’s official term for the Blue Screen of Death. The following monitors check whether a system has just rebooted as a result of a bugcheck screen. This is particularly useful in deployments where high server load is expected during the night when nobody is present to monitor the machine physically. For example, a backup may prompt a bugcheck. If the server has recovered, you would only see that the backup failed and investigate a backup issue rather than the real cause of the bugcheck.

Event IDs

Event ID 1000

Type	Error
Log	System
Source	Save Dump
Further Information	Event ID: 1000

Event ID 1001

Type	Information
Log	System
Source	Save Dump
Further Information	Event ID: 1001

Event ID 6008

Type	Error
Log	System
Source	EventLog
Further Information	Event ID: 6008

Disk health

The following monitors check if the disks installed in the system are functioning optimally and are not reporting any warnings. The checks are performed at the operating system level, not at the hardware level; that is, the monitors will read the standard event logs the operating system creates.

Event IDs

Event ID 7

Type	Error
Log	System
Source	Disk
Further Information	Event ID: 7

Event ID 33

Type	Warning
Log	System
Source	Disk
Further Information	Event ID: 33

Event ID 57

Type	Warning
Log	System
Source	Disk
Further Information	Event ID: 57

Event ID 55

Type	Error
Log	System
Source	NTFS
Further Information	Event ID: 55

Event ID 6

Type	Error
Log	System
Source	Ftdisk
Further Information	Event ID: 6

Event ID 57

Type	Warning
Log	System
Source	Ftdisk
Further Information	Event ID: 57

Two Disk Space Monitors are also included. One will alert at 90% as an advisory and the other at 98% as a critical alert but, of course, this functionality is configurable. For example, users running an Exchange server may wish to disable the RAM alerts.

Server offline

The Online Status Monitor is configured to fifteen minutes’ downtime instead of a smaller interval. This is to give the device time to restart itself in instances of expected downtime (for example, a reboot to install an update) and to avoid notifying off-duty technicians in case of short-lived alerts like brief power cuts. During the day, any unexpected server downtime can be expected to be caught by monitors interfacing with the hardware.

IP conflicts

IP Conflicts are a potentially large source of issue in any networking scenario. In the best case, an IP conflict is just an inconvenience (for example, if it occurs between a printer and another device), but in the worst of cases it can cause disarray. It's therefore important to check the event logs for event IDs that indicate the server has detected an IP conflict with another device.

Event IDs

Event ID 4319

Type	Error
Log	System
Source	NetBT
Further Information	Event ID: 4319
Issue	Another device on the network has the same name as this device.
Impact	Both devices may have trouble communicating on the network. File shares and printers may be inaccessible.
Resolution	Check the server event log to identify the additional device with the same name and then change the other device name. You may also have to change internal DNS records to reflect the new name of the other device. Do not change the server name. Make sure the other device is not referenced by any other software and follow change control. It’s also possible that the server originally had a NIC Team and this has been subsequently broken. The server may now have two IP addresses, and the same NetBIOS name is being presented on both.

Event ID 4198

Type	Error
Log	System
Source	Tcpip
Further Information	Event ID: 4198
Issue	Another device on the network has the same IP address as this device.
Impact	Both devices may have trouble communicating on the network. File shares and printers may be inaccessible. The server may have disabled the network card to prevent further conflicts.
Resolution	Check the server event log to identify the other device that has the same IP address as the server. Check that the remote device has not been given the IP address by the DHCP server. You may need to make an exclusion in DHCP. Change the IP address of the other device (do not change the server IP) and then check if you need to change any DNS entries for the device. There may be software configured to address the other device by IP, so make sure you also update any software including printer shares. Follow change control for this.

Event ID 4199

Type	Error
Log	System
Source	Tcpip
Further Information	Event ID: 4199
Issue	Another device on the network has the same IP address as this device.
Impact	Both devices may have trouble communicating on the network. File shares and printers may be inaccessible.
Resolution	Check the server event log to identify the other device that has the same IP address as the server. Check that the remote device has not been given the IP address by the DHCP server. You may need to make an exclusion in DHCP. Change the IP address of the other device (do not change the server IP) and then check if you need to change any DNS entries for the device. There may be software configured to address the other device by IP, so make sure you also update any software including printer shares. Follow change control for this.

Patching

We recommend using a standard Patch Monitor to trigger when a patch installation has failed. In some instances, this can get quite noisy if a server is particularly out of date. In such cases, it is advisable to remove the Patch Monitor until the device is manually updated and then re-implement it.

CPU and memory monitoring

The policy checks for 100% usage of RAM, as well as CPU. Such scenarios are rare in general usage, even on a high-load server. Therefore, both of these alerts should be treated as critical.

TIP As with the Reboot Required Monitor, it is advisable to place the wildcard for the device’s hostname within the subject line of any alert email that has been configured for sending.

Windows: Workstation

The goal with the Windows Workstation Monitoring Policy is to keep things as simple as possible and focus only on core Windows health metrics. The policy’s aim is simply to alert when the workstation is near (or past) the brink.

The following criteria are monitored in this policy:

Disk monitoring

We are monitoring for corrupt file systems and bad sectors as these would indicate that the HDD could be failing but would not be obvious to the end user. Catching these early and replacing the disk in a controlled manner is better than having a dead laptop / desktop that needs to be urgently rebuilt. See below for information about the monitored event IDs.

The Disk Space Monitor catches those workstations that have limited disk space on the system drive. If a device completely runs out of disk space, it may not boot, again leading to an urgent issue that could have been dealt with earlier in a controlled manner using a disk cleanup script as an auto-response.

Event IDs

Event ID 55

Type	Error
Log	System
Source	NTFS
Further Information	Event ID: 55

Event ID 7

Type	Error
Log	System
Source	Disk
Further Information	Event ID: 7

Event ID 33

Type	Warning
Log	System
Source	Disk
Further Information	Event ID: 33

Event ID 57

Type	Warning
Log	System
Source	Disk
Further Information	Event ID: 57

Event ID 6008

Type	Warning
Log	System
Source	EventLog
Further Information	Event ID: 6008

Service monitors

Monitoring key workstations’ services and auto-starting them if they stop keeps these workstations running and the number of issues a service desk has to deal with to a minimum. Because the response to a stopped service is to start it again, these monitors are set to auto-resolve and are not generally allowed to create tickets.

The following services are monitored and auto-started:

Service	Description
EventLog	Allows events to be logged.
W32time	Syncs local workstation time with either a domain controller or online time server.
Spooler	Enables printing.
LanManWorkstation	Provides access to network resources.

	Need help? Submit a Kaseya Helpdesk request.
	Want to talk about it? Head over to Kaseya Community.
	Have a new feature idea? Visit the Kaseya Ideas portal.
	Provide feedback for the Documentation team.

Best practices for Monitoring policies

Overview

Offline alert

Overall system status

Memory status

Thermal status

Fan status

Fault tolerant PSU status

RAID controller status

Event IDs 2003, 2037, 2040, 2044, 3000, 3001, 3003, 3005, 3006, 3014, 3015, 3016, 3017, 3018, 3020, 3021, 3022, 3031, 3103, 3104, 3105, 3106, 3107, 3110, 3111, 3120, 3121

All event IDs except 1000, 2131, 2132, 1012, 2189, 2242, 2335, 0

All event IDs

All event IDs

All event IDs

All event IDs

All event IDs

All event IDs

All event IDs

Event ID 8206

Event ID 1003

Event ID 1112

Event ID 1113

Event ID 5000

Event ID 1159

Event ID 9690

Event ID 9688

Event ID 1005

Event ID 1

All event IDs

All event IDs

All event IDs

All event IDs

Event ID 8213

All event IDs except 9327, 9320, 9386, 5008, 9040, 9325

All event IDs

All event IDs

Event ID 231

Event ID 15002

Event ID 15004

Event ID 15006

Event ID 24

Event ID 25

Event ID 26

Event ID 12015

Event ID 12018

All event IDs

All event IDs

All event IDs

All event IDs

All event IDs

All event IDs

All event IDs

Windows service monitoring

Event ID 7000

Event ID 7013

Event ID 7038

Reboot required

Bugcheck monitoring

Event ID 1000

Event ID 1001

Event ID 6008

Disk health

Event ID 7

Event ID 33

Event ID 57

Event ID 55

Event ID 6

Event ID 57

Server offline

IP conflicts

Event ID 4319

Event ID 4198

Event ID 4199

Patching

CPU and memory monitoring

Disk monitoring

Event ID 55

Event ID 7

Event ID 33

Event ID 57