Best practices for Monitoring policies
SECURITY Permission to manage the ComStore, view devices, and manage sites. Refer to Permissions.
NAVIGATION Account > Policies
NAVIGATION Sites > select a site > Policies
NAVIGATION Sites > select a site > Devices > select a device > Policies
Overview
NOTE Before reading this document, we recommend you get familiar with the following topics: Policies, ComStore, Creating a component, and Component Library.
In addition to setting up your own policies, a selection of Monitoring policies is freely available to download from the ComStore or the Policies page. These include best practices to monitor the most common platforms and applications such as Exchange and SQL.
These Monitoring policies aim to provide a best-practice solution for the most typically encountered usage scenarios involving Datto RMM. However, they only serve as guidelines and may require modification depending on device configuration. (For example, ensure that network Monitoring policies are querying the correct SNMP OIDs of your devices.)
We encourage you to try these policies on your own devices to provide a solid, baseline monitoring solution to which your own monitoring can be added. Once added to your account, a policy downloaded from the ComStore or the Policies page becomes a regular Monitoring policy, which can be configured and modified as required. Targets typically must be configured before use.
For more information, refer to Download a ComStore policy.
As follows are just some of the Monitoring policies provided by Datto RMM, along with detailed information on the criteria we monitor and further relevant information.
The goal of monitoring ILO via SNMP is to get enough information to indicate there is an issue but not to overrun the service desk with information about the issue.
NOTE SNMP monitoring should be as simple as possible, only reporting on the status of the key hardware. It’s very easy to over-complicate SNMP monitoring and, in turn, cause an inordinate level of alerts to be generated when a single alert would be sufficient to indicate that there is a hardware problem.
The following criteria are monitored in this policy:
Offline alert
If the ILO interface is either unplugged or otherwise stops responding, the monitoring you had set up will not be returning data. It’s important to know when your monitoring has been interrupted, especially at the hardware level.
Overall system status
This monitor covers the general health of the hardware and is triggered when any component enters a warning or error state.
Memory status
This monitor covers the health of the memory hardware. We are looking for memory errors.
NOTE It is not monitoring memory utilization. That would be an operating system monitor.
Thermal status
This monitor checks for thermal alarms based on the CPU and chassis temperature sensor readings.
Fan status
This monitor covers any failed fans in the chassis.
Fault tolerant PSU status
Rather than checking each power supply unit by item, we just need to know when the PSU is no longer redundant. For instances where only a single power supply exists, a failure would cause the system to go offline; therefore, monitoring by PSU instance is not required.
RAID controller status
This monitor is reading the current RAID controller status. The RAID controller knows what disks are attached and what RAID arrays are configured and knows when there is an issue. All we need to do is read the controller status and alert when there is an issue. The engineers will connect to the server anyway to see what is happening, so reporting on the status of each disk is not required.
Not all APC UPS devices have a network card; therefore, they need to be monitored over USB. The APC PowerChute software needs to be installed and configured to communicate with the UPS device.
IMPORTANT Ensure that the software is also configured to write to the Windows Event Logs.
See below for information about the monitored event IDs.
Event IDs 2003, 2037, 2040, 2044, 3000, 3001, 3003, 3005, 3006, 3014, 3015, 3016, 3017, 3018, 3020, 3021, 3022, 3031, 3103, 3104, 3105, 3106, 3107, 3110, 3111, 3120, 3121
Type | Error or Warning |
---|---|
Log | Application |
Source | APCPBEAgent |
Some Dell servers ship without an iDRAC interface. Monitoring the hardware of such servers needs to be achieved using the Dell software and the Windows Event Logs.
IMPORTANT You must install all the Dell server management software and drivers for the events to be written to the Windows Event Logs.
See below for information about the monitored event IDs.
All event IDs except 1000, 2131, 2132, 1012, 2189, 2242, 2335, 0
Type | Error or Warning |
---|---|
Log | System |
Source | Server Administrator |
Occasionally, HP servers do not have an ILO interface. In such situations, you must fall back to monitoring the server using the HP software and the Windows Event Logs.
IMPORTANT It is very important that you install all the HP management software and device drivers; otherwise, hardware events will not be posted to the Windows Event Logs.
See below for information about the monitored event IDs.
All event IDs
Type | Error or Warning |
---|---|
Log | System |
Source | CPQTeamMP |
Issue | CPQTeamMP is responsible for the hardware teaming of the network card. If there is an error with the team (for example, the link is down), these are the event logs that will report the issue. |
Impact | Network communications will be either down or limited by a link or teaming issue. |
Resolution | Check the event log and take appropriate action. |
All event IDs
Type | Error or Warning |
---|---|
Log | System |
Source | Foundation Agents |
Issue | Foundation Agents monitor the health of the HP server hardware and report issues directly to the Windows Event Log. Check the event logs for the nature of the issue being reported. |
All event IDs
Type | Error or Warning |
---|---|
Log | System |
Source | HP System |
Issue | HP System monitors the overall health of the server hardware. |
All event IDs
Type | Error or Warning |
---|---|
Log | System |
Source | HP System Management Homepage |
Issue | HP System monitors the overall health of the server hardware. |
All event IDs
Type | Error or Warning |
---|---|
Log | System |
Source | HP Wbem Dump |
Issue | HP System monitors the overall health of the server hardware. |
All event IDs
Type | Error or Warning |
---|---|
Log | System |
Source | HpCISSs2 |
Issue | HP System monitors the overall health of the server hardware. |
All event IDs
Type | Error or Warning |
---|---|
Log | System |
Source | hpqmgmt |
Issue | HP System monitors the overall health of the server hardware. |
This policy is used to monitor the general health of the Exchange server including the Information Store size. See below for information about the monitored event IDs.
NOTE The policy's Windows Performance Monitors are used for performance graphing. These monitors do not alert.
NOTE The Exchange services are monitored using the server operating system monitor; therefore, they are not required in this Monitoring policy. Refer to Windows: Server.
You can also monitor the message queue length using the policy's performance monitors. High numbers of messages in the queue would indicate either inbound or outbound spam. Inbound spam would be a sales driver to supply your customer with an anti-spam solution. Outbound spam would be an indicator that there is an internal machine infected with a spambot.
NOTE Don’t set the monitor value too low; otherwise, you will trigger lots of alerts for normal queue levels.
Event ID 8206
Type | Error |
---|---|
Log | Application |
Source | MSExchangeFBPublish |
Further Information | Event ID: 8206 |
Event ID 1003
Type | Error |
---|---|
Log | Application |
Source | MSExchangeIS |
Further Information | Event ID: 1003 |
Issue | The disk is full, Exchange is shutting down. |
Resolution | Clear disk space and start Exchange Information Store again. |
Event ID 1112
Type | Warning |
---|---|
Log | Application |
Source | MSExchangeIS |
Further Information | Event ID: 1112 |
Issue | The database (named in the error) has reached its maximum allowed size. |
Impact | No new mail can be sent or received. |
Resolution | Delete items and then shrink the database using an offline defrag using eseutil. Alternatively, create an archive database and move old items to the archive. |
Event ID 1113
Type | Error |
---|---|
Log | Application |
Source | MSExchangeIS |
Further Information | Event ID: 1113 |
Issue | The disk that holds the Exchange log files is full. |
Impact | No new mail can be sent or received. |
Resolution | Clean up disk space on the log volume or, alternatively, move the log files to a disk with more free space. |
Event ID 5000
Type | Error |
---|---|
Log | Application |
Source | MSExchangeIS |
Further Information | Event ID: 5000 |
Issue | The Information Store cannot start. |
Impact | Exchange is not running; therefore, email is down. |
Resolution | Other events will be logged that suggest why the Information Store cannot start. Check for events logged about the same time as the Information Store attempted to start. |
Event ID 1159
Type | Error |
---|---|
Log | Application |
Source | MSExchangeIS |
Further Information | Event ID: 1159 |
Issue | An outside process has errors accessing the Exchange database. The event logged will detail the process name. |
Impact | Depends on the process that has errors. |
Resolution | Check the event log to see which process has errors accessing the database. |
Event ID 9690
Type | Error |
---|---|
Log | Application |
Source | MSExchangeIS |
Further Information | Event ID: 9690 |
Issue | The database is above the maximum size limit. |
Impact | Exchange is down. |
Event ID 9688
Type | Warning |
---|---|
Log | Application |
Source | MSExchangeIS |
Further Information | Event ID: 9688 |
Event ID 1005
Type | Error |
---|---|
Log | Application |
Source | MSExchangeSA |
Further Information | Event ID: 1005 |
Issue | An error occurred but at this point we don't know what. Check the application event log for more information. |
Impact | Depends on the actual error. |
Event ID 1
Type | Error |
---|---|
Log | Application |
Source | WSH |
Issue | This is an IIS error that indicates a problem with a local hosted website (for example, Outlook Web Access). |
Impact | The local hosted website may not be working. |
Resolution | Restart IIS. |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | CDOEXM |
Issue | CDOEXM are Exchange Collaboration objects and an issue with one or more may mean that calendars, etc. are not functioning correctly. |
Resolution | Depends on the exact issue reported in the event log. |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | IISADMIN |
Issue | An issue with IIS could mean all or part of Exchange is not working. Older versions of Exchange use IIS to present data to clients. |
Impact | Outlook may not connect to Exchange or certain information may not be available. |
Resolution | Depends on the exact issue reported in the event log. |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | IMAP4SVC |
Issue | IMAP is an older email protocol that is not often used anymore but, if it is, we need to check for IMAP errors on the Exchange server. |
Impact | Clients using IMAP to receive and send email may not be functioning. |
Resolution | Depends on the exact issue reported in the event log. |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | MSExchangeDSAccess |
Issue | This is an integral part of Exchange and any errors or warnings need to be investigated. |
Resolution | Depends on the exact issue reported in the event log. |
Event ID 8213
Type | Error |
---|---|
Log | Application |
Source | MSExchangeFBPublish |
Further Information | Event ID: 8213 |
Issue | The Exchange System Attendant service is having trouble creating a connection / session to a mailbox. |
Impact | Maintenance routines may not be running against a mailbox / auto-replies may not be working. |
Resolution | Depends on the exact issue reported in the event log. |
All event IDs except 9327, 9320, 9386, 5008, 9040, 9325
Type | Error or Warning |
---|---|
Log | Application |
Source | MSExchangeSA |
Issue | The Exchange System Attendant takes care of Exchange automation, such as rule processing and auto-replies. |
Impact | Various, depending on the exact issue reported in the event log. |
Resolution | Depends on the exact issue reported in the event log. |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | POP3SVC |
Issue | POP3 is used by older email clients. If the POP3 service has errors, the email client may experience issues. |
Resolution | Depends on the exact issue reported in the event log. |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | SMTPSvc |
Issue | SMTP is used by Exchange to send and receive emails. If there are issues with the SMTP service, the email flow will not be working. |
Resolution | Depends on the exact issue reported in the event log. |
Event ID 231
Type | Error |
---|---|
Log | Application |
Source | ExchangeStoreDB |
Further Information | Event ID: 231 |
Issue | The Exchange Store has exceeded its maximum size and will not mount. |
Impact | Exchange is down. |
Event ID 15002
Type | Warning |
---|---|
Log | Application |
Source | MSExchangeTransport |
Further Information | Event ID: 15002 |
Issue | Server is almost out of disk space. Email processing has stopped. |
Impact | Exchange has stopped processing email. Email is not being sent or received. |
Resolution | Free up disk space and restart the Transport service. |
Note |
NOTE Back Pressure is Microsoft’s term for load. While it can be measured as part of the Exchange policy, generally the only way to resolve back pressure alerts is either to upgrade to more capable hardware or reduce server load. An Exchange server that enters this state may refuse new connections, causing issues with email delivery. |
Event ID 15004
Type | Error |
---|---|
Log | Application |
Source | MSExchangeTransport |
Further Information | Event ID: 15004 |
Issue | Server is out of resources. |
Impact | Exchange has stopped processing email. Email is not being sent or received. |
Resolution | Free up resources and restart the Transport service. |
Note |
NOTE Back Pressure is Microsoft’s term for load. While it can be measured as part of the Exchange policy, generally the only way to resolve back pressure alerts is either to upgrade to more capable hardware or reduce server load. An Exchange server that enters this state may refuse new connections, causing issues with email delivery. |
Event ID 15006
Type | Error |
---|---|
Log | Application |
Source | MSExchangeTransport |
Further Information | Event ID: 15006 |
Issue | Server is almost out of disk space. Email processing has stopped. |
Impact | Exchange has stopped processing email. Email is not being sent or received. |
Resolution | Free up disk space and restart the Transport service. |
Note |
NOTE Back Pressure is Microsoft’s term for load. While it can be measured as part of the Exchange policy, generally the only way to resolve back pressure alerts is either to upgrade to more capable hardware or reduce server load. An Exchange server that enters this state may refuse new connections, causing issues with email delivery. |
Event ID 24
Type | Error |
---|---|
Log | Application |
Source | MSExchange Web Services |
Further Information | Event ID: 24 |
Issue | Exchange certificate issue |
Note |
NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients. |
Event ID 25
Type | Error or Warning |
---|---|
Log | Application |
Source | MSExchange Web Services |
Issue | Exchange certificate has expired or will expire soon. |
Note |
NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients. |
Event ID 26
Type | Warning |
---|---|
Log | Application |
Source | MSExchange Web Services |
Further Information | Event ID: 26 |
Note |
NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients. |
Event ID 12015
Type | Error |
---|---|
Log | Application |
Source | MSExchangeTransport |
Further Information | Event ID: 12015 |
Issue | A certificate has expired. Use the thumbprint to show which certificate has expired. |
Resolution | If a certificate is not used anymore, it should be detached from services and not just abandoned as this will cause this error. People often assign new certificates to services but fail to remove the old ones. This is the most common cause for this event. |
Note |
NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients. |
Event ID 12018
Type | Warning |
---|---|
Log | Application |
Source | MSExchangeTransport |
Further Information | Event ID: 12018 |
Issue | A certificate will expire soon. This may log every hour or each time the Transport service starts. This warning cannot be disabled. |
Resolution | Replace the certificate as soon as possible. You do not have to wait for it to expire before you replace it, as any remaining time can be added to the new certificate if the new certificate request has been generated from the current server and the certificate has been requested from the same vendor. |
Note |
NOTE Monitoring Exchange SSL certificate expiry is essential as an expired certificate may cause email issues for clients. |
Monitoring SQL servers can generate a lot of noise depending on how well the SQL instance has been configured. It may, for example, run the server out of RAM if SQL has not been capped, causing errors to be raised by this monitor. Therefore, before using this policy, ensure that the SQL server is properly configured.
See below for information about the monitored event IDs.
NOTE The policy's Windows Performance Monitors are used for performance graphing. These monitors do not alert.
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | MSSQLSERVER |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | SQL Server Report Service |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | SQLBrowser |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | SQLDUMPER |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | SQLSERVERAGENT |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | SQLVDI |
All event IDs
Type | Error or Warning |
---|---|
Log | Application |
Source | SQLWRITER |
This policy was set up to work alongside any other policy you might apply for the various roles the server is handling. The goal of this policy is to pay attention to the inner workings and health of the server itself, rather than to any of its duties.
NOTE The policy's Windows Performance Monitors are used for performance graphing. These monitors do not alert.
The following criteria are monitored in this policy:
Windows service monitoring
Of note in this policy is the Event Log Monitor which is configured to look for specific codes from the Windows Service Control Manager (SCM). These codes are raised when the SCM cannot start a particular service due to an error.
Event ID 7000
Type | Error |
---|---|
Log | System |
Source | Service Control Manager |
Further Information | Event ID: 7000 |
Issue | A service <named> failed to start. |
Event ID 7013
Type | Error |
---|---|
Log | System |
Source | Service Control Manager |
Further Information | Event ID: 7013 |
Issue | The username or password configured for a service is incorrect. The service failed to start. |
Event ID 7038
Type | Error |
---|---|
Log | System |
Source | Service Control Manager |
Further Information | Event ID: 7038 |
Issue | The username or password configured for a service is incorrect. The service failed to start. |
The Event Log Monitor works best when used alongside the Stopped Auto-Start Services Monitor component available for free from the ComStore. We would recommend that all users applying this policy perform this second step to facilitate full service monitoring. This additional Component Monitor does not need to be configured to raise tickets as some services are intended to start and stop automatically (for example, Windows Update).
The Component Monitor can be configured as follows:
Reboot required
Servers may need a reboot for various reasons, such as patch installs, software changes, etc. Therefore, we would also recommend that you use the policy alongside the Reboot Required Monitor component. It's available for free from the ComStore and can be configured as follows:
The monitor checks all the areas that indicate a reboot is required but only every four hours. It’s not necessary to check more frequently.
NOTE It is recommended to configure an alert email with the device’s hostname in the subject.
Bugcheck monitoring
A "bugcheck" is Microsoft’s official term for the Blue Screen of Death. The following monitors check whether a system has just rebooted as a result of a bugcheck screen. This is particularly useful in deployments where high server load is expected during the night when nobody is present to monitor the machine physically. For example, a backup may prompt a bugcheck. If the server has recovered, you would only see that the backup failed and investigate a backup issue rather than the real cause of the bugcheck.
Event ID 1000
Type | Error |
---|---|
Log | System |
Source | Save Dump |
Further Information | Event ID: 1000 |
Event ID 1001
Type | Information |
---|---|
Log | System |
Source | Save Dump |
Further Information | Event ID: 1001 |
Event ID 6008
Type | Error |
---|---|
Log | System |
Source | EventLog |
Further Information | Event ID: 6008 |
Disk health
The following monitors check if the disks installed in the system are functioning optimally and are not reporting any warnings. The checks are performed at the operating system level, not at the hardware level; that is, the monitors will read the standard event logs the operating system creates.
Event ID 7
Type | Error |
---|---|
Log | System |
Source | Disk |
Further Information | Event ID: 7 |
Event ID 33
Type | Warning |
---|---|
Log | System |
Source | Disk |
Further Information | Event ID: 33 |
Event ID 57
Type | Warning |
---|---|
Log | System |
Source | Disk |
Further Information | Event ID: 57 |
Event ID 55
Type | Error |
---|---|
Log | System |
Source | NTFS |
Further Information | Event ID: 55 |
Event ID 6
Type | Error |
---|---|
Log | System |
Source | Ftdisk |
Further Information | Event ID: 6 |
Event ID 57
Type | Warning |
---|---|
Log | System |
Source | Ftdisk |
Further Information | Event ID: 57 |
Two Disk Space Monitors are also included. One will alert at 90% as an advisory and the other at 98% as a critical alert but, of course, this functionality is configurable. For example, users running an Exchange server may wish to disable the RAM alerts.
Server offline
The Online Status Monitor is configured to fifteen minutes’ downtime instead of a smaller interval. This is to give the device time to restart itself in instances of expected downtime (for example, a reboot to install an update) and to avoid notifying off-duty technicians in case of short-lived alerts like brief power cuts. During the day, any unexpected server downtime can be expected to be caught by monitors interfacing with the hardware.
IP conflicts
IP Conflicts are a potentially large source of issue in any networking scenario. In the best case, an IP conflict is just an inconvenience (for example, if it occurs between a printer and another device), but in the worst of cases it can cause disarray. It's therefore important to check the event logs for event IDs that indicate the server has detected an IP conflict with another device.
Event ID 4319
Type | Error |
---|---|
Log | System |
Source | NetBT |
Further Information | Event ID: 4319 |
Issue | Another device on the network has the same name as this device. |
Impact | Both devices may have trouble communicating on the network. File shares and printers may be inaccessible. |
Resolution | Check the server event log to identify the additional device with the same name and then change the other device name. You may also have to change internal DNS records to reflect the new name of the other device. Do not change the server name. Make sure the other device is not referenced by any other software and follow change control. It’s also possible that the server originally had a NIC Team and this has been subsequently broken. The server may now have two IP addresses, and the same NetBIOS name is being presented on both. |
Event ID 4198
Type | Error |
---|---|
Log | System |
Source | Tcpip |
Further Information | Event ID: 4198 |
Issue | Another device on the network has the same IP address as this device. |
Impact | Both devices may have trouble communicating on the network. File shares and printers may be inaccessible. The server may have disabled the network card to prevent further conflicts. |
Resolution | Check the server event log to identify the other device that has the same IP address as the server. Check that the remote device has not been given the IP address by the DHCP server. You may need to make an exclusion in DHCP. Change the IP address of the other device (do not change the server IP) and then check if you need to change any DNS entries for the device. There may be software configured to address the other device by IP, so make sure you also update any software including printer shares. Follow change control for this. |
Event ID 4199
Type | Error |
---|---|
Log | System |
Source | Tcpip |
Further Information | Event ID: 4199 |
Issue | Another device on the network has the same IP address as this device. |
Impact | Both devices may have trouble communicating on the network. File shares and printers may be inaccessible. |
Resolution | Check the server event log to identify the other device that has the same IP address as the server. Check that the remote device has not been given the IP address by the DHCP server. You may need to make an exclusion in DHCP. Change the IP address of the other device (do not change the server IP) and then check if you need to change any DNS entries for the device. There may be software configured to address the other device by IP, so make sure you also update any software including printer shares. Follow change control for this. |
Patching
We recommend using a standard Patch Monitor to trigger when a patch installation has failed. In some instances, this can get quite noisy if a server is particularly out of date. In such cases, it is advisable to remove the Patch Monitor until the device is manually updated and then re-implement it.
CPU and memory monitoring
The policy checks for 100% usage of RAM, as well as CPU. Such scenarios are rare in general usage, even on a high-load server. Therefore, both of these alerts should be treated as critical.
NOTE As with the Reboot Required Monitor, it is advisable to place the wildcard for the device’s hostname within the subject line of any alert email that has been configured for sending.
The goal with the Windows Workstation Monitoring Policy is to keep things as simple as possible and focus only on core Windows health metrics. The policy’s aim is simply to alert when the workstation is near (or past) the brink.
The following criteria are monitored in this policy:
Disk monitoring
We are monitoring for corrupt file systems and bad sectors as these would indicate that the HDD could be failing but would not be obvious to the end user. Catching these early and replacing the disk in a controlled manner is better than having a dead laptop / desktop that needs to be urgently rebuilt. See below for information about the monitored event IDs.
The Disk Space Monitor catches those workstations that have limited disk space on the system drive. If a device completely runs out of disk space, it may not boot, again leading to an urgent issue that could have been dealt with earlier in a controlled manner using a disk cleanup script as an auto-response.
Event ID 55
Type | Error |
---|---|
Log | System |
Source | NTFS |
Further Information | Event ID: 55 |
Event ID 7
Type | Error |
---|---|
Log | System |
Source | Disk |
Further Information | Event ID: 7 |
Event ID 33
Type | Warning |
---|---|
Log | System |
Source | Disk |
Further Information | Event ID: 33 |
Event ID 57
Type | Warning |
---|---|
Log | System |
Source | Disk |
Further Information | Event ID: 57 |
Event ID 6008
Type | Warning |
---|---|
Log | System |
Source | EventLog |
Further Information | Event ID: 6008 |
Service monitors
Monitoring key workstations’ services and auto-starting them if they stop keeps these workstations running and the number of issues a service desk has to deal with to a minimum. Because the response to a stopped service is to start it again, these monitors are set to auto-resolve and are not generally allowed to create tickets.
The following services are monitored and auto-started:
Service | Description |
---|---|
EventLog | Allows events to be logged. |
W32time | Syncs local workstation time with either a domain controller or online time server. |
Spooler | Enables printing. |
LanManWorkstation | Provides access to network resources. |