System monitoring with Nagios and SNMP

The server systems of m-privacy GmbH have sensors for NRPE-based monitoring systems (e.g. Nagios) or for SNMP-based monitoring systems. This allows important operating states to be checked remotely so that countermeasures can be taken before critical limit values are exceeded. The following list provides an overview of the implemented checks.

Not every system has the total number of possible sensors, so not all check points must always be active. The specified threshold values are predefined, but can be changed if necessary.

Hinweis

In order for TightGate-Pro to be monitored with a monitoring system, the monitoring must be activated as administrator config under Services > Nagios NRPE Support/Start SNMP Service. In addition, the IP address of the monitoring server must be stored under Services > Maintenance and Updates > Remote Administrator IP.

As root enter the following command in the console:

 cd /usr/lib/nagios/plugins/
 ./check_nrpe -H [IP address of TightGate-Pro] -c check_[name of checkpoint]

Ex. for the checkpoint maint:

 ./check_nrpe -H 192.168.4.1 -c check_maint

Enter the following command from the monitoring computer to read out individual checks:

 snmpget -v3 -u snmp-user -A [PASSWORD] -a SHA -l authnoPriv [IP address of TightGate-Pro] [single MIB or OID]
CheckpointDescriptionState
OK
State
Warning
State
Critical
Activity at
Warning
Activity at
Critical
maintChecks whether a node is available and not in maintenance mode. If applicable, displays the time of a scheduled maintenance.Node available and not in maintenance modeNode in maintenance mode Log in as administrator maint and exit maintenance mode after maintenance is complete.
loadReturns the average system load for the time points: 1, 5 and 15 minutes.The workload is lower than the value set by the administrator config under the system preferencesThe workload is higher than the value set by the administrator config under the system preferences but less than twice the valueThe workload is higher than twice the value set by the administrator config under the system preferencesLog in as administrator root and open a console. The command atop displays the process overview, indicating the load per process. The list can be sorted by load value by entering p in the window. Processes that cause particularly high load can be terminated using kill . Restarting the system can also help. In any case, if the system load is excessive, inform the technical customer service of m-privacy GmbH .
softmodeChecks whether the node is in softmode, i.e. in a state not protected by RSBAC.Softmode is not activated Softmode is activatedPlease deactivate softmode as user Security.
usersChecks for the maximum number of VNC connections (TightGate viewer) stored as config and outputs the current number of viewer and lock connections.< Max VNCÜber Max VNC aber unter Max VNC +10> Max VNC +10If the limit values are exceeded, performance losses are to be expected.
disksChecks free memory on the hard disks.> 20% freeBetween 20% and 10% free< 10% freiStatusseite des entsprechenden Systems aufrufen und Massenspeicher auf Belegung überprüfen. Bei Platzmangel sollten insbesondere die Benutzerverzeichnisse in /home geprüft werden. Evtl. können z. B. alte Backups gelöscht werden. Weiterhin sollten die Logdateien in /var/log geprüft werden. Zu große Logdateien können gelöscht werden, um Platz auf dem Datenträger zu schaffen.
disk_loadPlattenaktivität aller verfügbaren Platten in %< 70%Zwischen 70% und 90%> 90%Problematic with HHD, unlikely with SSD.
zombie_procsUndermined zombie processes, can indicate errors.No zombie processesUnder 10 zombie processesOver 10 zombie processesZombie processes can occur occasionally and usually do not affect system operation. Frequent occurrence of zombie processes indicates errors in file handling. It is recommended to inform the technical customer service of m-privacy GmbH .
total_procsChecks the number of running processes.< 4.000Zwischen 4.000 und 10.000> 10,000Restarting the system may reduce the number of running processes.
Note: This checkpoint is rather less meaningful, as a warning is only given at very high values.
swapChecks for free swap memory and returns the value of the maximum value set and the free memory.> 50% of the maximum value freeBetween 50% and a minimum of 20% of the maximum value free< 20% des Maximalwerts freiBei dauerhafter Überschreitung der Grenzwerte zunächst Last reduzierende Maßnahmen ergreifen (z. B. Nutzung der Browser-Add-ons "Flashblock", "AdBlock" und dergl.). Auch eine Erweiterung des Arbeitsspeichers kann Abhilfe schaffen. Es wird empfohlen, die Maßnahmen mit dem technischen Kundendienst der m-privacy GmbH zu erörtern.
ntpPrüft die Erreichbarkeit von NTP-Zeitservern und zeigt Abweichungen zur lokalen Systemzeit an.Zeitdifferenz < 60 SekundenZeitdifferenz zwischen 60 und 120 SekundenNicht erreichbar oder Zeitdifferenz > 120 secondsEspecially in cluster systems, all nodes must have the same system time. If the time difference to the reference of the stored NTP server is > 1 minute, action is required!
Please log in as administrator config and use the menu item Check network to verify the problem and adjust the time if necessary. If necessary, an alternative external NTP server should be configured to ensure proper system operation.
memavailableDisplay the available memory in kByte.over 1,000,000 (1 GB RAM)Value between 1,000,000 and 100,000Value under 100,000 (100 MB RAM)Increase the memory or reduce the number of users on the server.
memorypressurekilledNumber of user sessions that were automatically logged off due to acute memory shortage within the last 24 hours.0Value less than 0 Increase in memory or decrease in the number of users on the server.
pressure_cpuChecks whether requests are delayed due to a bottleneck in the CPU.Delays <20% aller AnfragenVerzögerungen zwischen 20%>50% of all requestsDelays >50% of all requestsThe number of authorised users should be reduced on the node.
pressure_ioChecks whether requests are delayed due to a read/write bottleneck or network bottlenecks.Delays <20% aller AnfragenVerzögerungen zwischen 20%>50% of all requestsDelays >50% of all requestsIf SSDs are used, bottlenecks usually occur in connection with network bottlenecks.
pressure_memoryChecks whether requests are delayed due to a memory bottleneck.Delays <2% aller AnfragenVerzögerungen zwischen 2%>10% of all requestsDelays >10% of all requestsThe available RAM should be extended or the number of authorised users on the node should be reduced.
sshChecks the reachability of a Secure Shell and returns the SSH version.Reachable Not reachableIf SSH is reported as unreachable, the administrator config should first execute a Apply . If SSH is still reported as unreachable, the system must be restarted in recover mode. In this case, it is recommended to contact the technical customer service of m-privacy GmbH .
dnsChecks the DNS server entered. Returns the IP address and the response time of the DNS server.Resolution of the IP address possible. Resolution of the IP address not possible.Check DNS server, enter alternative DNS server if necessary.
bugSearches the kern.log file for keywords that indicate kernel errors.No errors found Error foundInform technical customer service of m-privacy GmbH .
cronChecks the number of running cron jobs.1 to 10 cron jobsBetween 11 and 20 cron jobs> 20 or no cron jobsLog in as administrator root and call console. Command sequence ps tree -ah locates the blocked cron job. Check possible services and take appropriate action, e.g. as administrator config Apply or restart the system.
versionsCompares the installed software version with the currently available software version.
Note: This check can only be called up directly a maximum of 2 times a day. Each additional call returns the last result with the note "(cached)". If you want to force the call, you can call up "Available updates" once beforehand ( update do not forget to log off again). The check is then run again.
No newer version availableUpdates availableUpdates available for more than 6 monthsLog in as administrator update and Perform autoupdate
vncChecks the accessibility of the VNC server and returns its response time and the port set.Reachable Not reachableIf VNC is activated in the configuration and is still reported as unreachable, the administrator config should first execute a Full Apply . If VNC is still shown as unreachable, the system must be restarted in recover mode. In this case, it is advisable to consult the technical customer service of m-privacy GmbH .
diskerrorSearches the kern.log file for keywords that indicate hard disk errors.No errors found Errors foundWarnings indicate faulty hard disks. This can lead to data inconsistencies or data loss. Please contact technical support at m-privacy GmbH .
licenceChecks for a valid licence and returns the number of licences used and the expiry date.Licence valid Licence invalidThe licence must be renewed via the technical customer service of m-privacy GmbH .
applyChecks whether it is necessary to apply as Administrator Config.No apply necessary Apply necessaryIf Nagios signals that an apply is necessary, please log in as administrator config and execute an apply.
slabsCheck for memory areas in the core.< 10 Mio.Zwischen 10 und 100 Mio.> 100 millionIndicates memory leaks and core errors.
backupChecks for existing backup and any errors that may have occurred. Returns date and time of the last backup created, if found.Backup exists and is free of errorsBackup is faulty or no automatic backup has been configuredBackup does not exist or service is not availableAs administrator backuser Log in and check log for errors. It can be called up with the command Show last log.Check whether as administrator backuser under Configuration > Frequency inappropriate settings may have been selected. Then, for example, check in the log whether a backup was created and check for errors if necessary.
smart_sd*Checks the SMART status of the respective hard disk and returns the detected status. Replace the * character with the respective purchase plant letter.Hard disk OK + current temperatureTemperature > 45 °CTemperature > 50 °CIf the temperature is too high, the cooling of the system should be checked. If the hard disk is not OK, the errors of the S.M.A.R.T. check of the disk are also output. Measures can be a system start from the rescue system or execution of a fsck.
check_definedusersChecks the number of created users in TightGate-Pro and shows how many user IDs are currently created in TightGate-Pro.At least 5 new user IDs can still be created.Only a maximum of 5 new users can be created.A maximum of one new identifier can still be created or the maximum number of user identifiers has already been reached.Please purchase additional licences from TightGate-Pro.

Optimal checkpoints can be used depending on the system configuration to monitor specific processes.

Depending on how many Ceph servers are in use, all Nagios checkpoints are provided for each Ceph server. The following table lists all checks for the first Ceph server. The checkpoints for the second and further Ceph servers are to be used in the same way, but the number given in the checkpoint is to be incremented in each case.

CheckpointDescriptionState
OK
Status
Warning
State
Critical
Activity at
Warning
Activity at
Critical
homeusermountChecks whether /home/user is mounted in the directory tree. Returns the path of /home/user.Mounted Not mountedCheck hard disk, if necessary mount user directories manually as a test. It could also be a file system error, so notifying technical support at m-privacy GmbH is recommended.
backupmountChecks whether /home/backuser/backup has been correctly mounted in the directory tree.Mounted Not mountedCheck hard disk, if necessary mount user directories manually as a test. This could be a file system error, so notifying technical support at m-privacy GmbH is recommended.
cephCheck ceph status for internal Ceph data storageCeph is running normallyThere are errorsThere are errorsNotify the technical support of m-privacy GmbH .
check_ceph_hu_1_disksChecks free memory on the hard disks of the first Ceph server.> 20% freeBetween 20% and 10% free< 10 % freiIst der Speicher voll, nehmen Sie bitte Kontakt mit dem technischen Kundendienst der m-privacy GmbH auf.
check_ceph_hu_1_disk_loadPlattenaktivität aller verfügbaren Platten in %.< 70%Zwischen 70% und 90%> 90%Problematic with HHD.
check_ceph_hu_1_zombie_procsUndermined zombie processes, can indicate errors.No zombie processesUnder 10 zombie processesOver 10 zombie processesZombie processes may occur occasionally and usually do not affect system operation. Frequent occurrence of zombie processes indicates errors in file handling. It is recommended to inform the technical customer service of m-privacy GmbH .
check_ceph_hu_1_ntpChecks the accessibility of NTP time servers and displays deviations from the local system time.Time difference < 60 SekundenZeitdifferenz zwischen 60 und 120 SekundenNicht erreichbar oder Zeitdifferenz > 120 secondsIn the event of deviations, it is essential to restore synchronicity, otherwise cluster failures may occur.
check_ceph_hu_1_sshChecks the accessibility of a Secure Shell and returns the SSH version.Reachable Not reachableIf SSH is reported as unreachable, the administrator config should first execute a Apply . If necessary, contact the technical customer service of m-privacy GmbH .
check_ceph_hu_1_cronChecks the number of running cron jobs.1 to 10 cron jobs running11 to 20 cron jobs running> 20 or no cron jobs running
check_ceph_hu_1_raidChecks for the presence of a software RAID.RAID is running without errorsRAID is being synchronisedDisks are missing in the RAIDIf errors occur, check the RAID. Individual disks may be defective.
check_ceph_hu_1_cephDisplays the HEALTH status of the entire external Ceph.Ceph is OKCeph has a problemCeph is not intactYes, depending on the problem, the error messages of the Ceph must be responded to individually. If necessary, contact the technical customer service of m-privacy GmbH .
check_ceph_hu_1_smart_sd*Checks the SMART status of the respective hard disk and returns the detected status. The * sign is to be replaced by the respective purchase plant letter.Hard disk OK + current temperaturetemperature > 45 °Ctemperature > 50 °CIf the hard disk is too hot, the fan settings or the air flow in the server must be checked.
CheckpointDescriptionCondition
OK
State
Warning
State
Critical
Activity at
Warning
Activity at
Critical
backupChecks for existing backup and any errors that may have occurred. Returns the date and time of the last backup created, if found.Backup exists and is free of errorsBackup is faultyBackup does not exist or service is not available.Log in as administrator backuser and check log for errors. It can be called up with the command Display last log .Check whether as administrator backuser under Configuration > Frequency inappropriate settings may have been selected. Then check in the log, for example, whether a backup was created and check for errors if necessary.
scannerChecks whether the virus scanner's malware definitions are up-to-date and whether the virus scanner is running.Definitions current (or not older than 2 days)Definitions older than 2 days but younger than 1 weekVirus scanner is not running or no definitions are available or the definitions are older than 1 week.Update virus definitions according to the administration manual.Correctly configure as administrator config according to the administration manual.
sensorsCheck the hard disk temperatureTemperature below 110°CTemperature above 110°C and below 120°CTemperature above 120°CThere is a danger of overheating. Please check whether the fans are working properly. If necessary, make settings in the BIOS of the server. Please also check that the air flow around the server is guaranteed.
raidChecks for the presence of a software RAID.RAID is running without errorsRAID is being synchronisedDisks are missing in the RAIDIf errors occur, check the RAID. Individual disks may be defective.
squidChecks the availability of the proxy server and displays the response time and the connection port.All OK Port not accessible Not accessibleIf the port cannot be reached, check whether the service is running.
httpChecks for the availability of the HTTP protocol and outputs the response time.All OK Port Not reachableIf the port cannot be reached, check whether the service is running.
tempChecks the temperature of the mainboard (if a sensor is present) and outputs it.< 50 oC50 oC to 60 oC> 60 oCIf the temperature is exceeded, check the entire cooling system of the hardware (fan, heat sink, air ducts, etc.) and the air conditioning of the operating environment.
fanChecks whether a fan is running (if sensor is present).Running Not runningCheck hardware in case of problem message.
timedupdateChecks whether an automatic update is scheduled. The checkpoint only provides informative values for the planned update time.
identdCheck ident-deamon for logging of proxy connections.okNo logging configured, but proxy is runningLogging is configured, but proxy is not runningCorrect settings or restart service by Apply as config .
adldapCheck for accessibility of LDAP server / AD server during user administration Indicates errors when using Active Directory or LDAP servers. Measures are to be taken according to the instructions of the check.
nodesavailChecks for the availability of all nodes within a cluster of TightGate-Pro systemsAll nodes are availableFewer nodes are available than defined, but the minimum number is still presentNo nodes are accessible/available.Informative.