Monitoring quality of service

Quality of Service, or QoS, is designed to react to the general operation of a Domino® server in order to keep that server functioning reliably and always available. If QoS detects that a server is not responding or hung, QoS probing can be configured to email an administrator about the problem and/or automatically terminate the server and restart it. QoS log information can also be useful for analysis by IBM® Support.

About this task

CAUTION: QoS and fault recovery should not be enabled at the same time.
Important: If QoS (re)starts a server that has a password on the server.id file, the server will not start until an administrator connects to the console on that server and enters the password. Therefore, if you want QoS to be capable of (re)starting Domino without intervention on a specific server, for example at inconvenient times when an administrator is not available for a manual password entry, do not use a password on the server.id file on that server.

QoS requires that the Domino® server be run under the java controller using the java console.

The qosprobe add-in task can be configured with the following settings on the Domino® server in the server NOTES.INI file:
  • QOS_PROBE_INTERVAL=n

    The probe interval in minutes. This can be set in the notes.ini. The default is 1 minute.

  • QOS_PROBE_TIMEOUT=n

    The probe timeout in minutes. This can be set in the dcontroller.ini. The default is 5 minutes.

Tip: QOS_PROBE_TIMEOUT should be much greater than QOS_PROBE_INTERVAL. If the timeout occurs before the probe is set to respond, the server will be restarted constantly.
The server controller monitors a message queue to which the qosprobe add-in communicates its probing results. (SUCCESS, ERROR, TIMEOUT). The messages are captured in the qosctnrlrtimestamp.out file found in the server data directory. The following is an example of a SUCCESS message:
2010/01/07 07:42:56 QoS Probe: SUCCESS (88ms)
The following is an example of an error message:
2010/01/07 08:05:59 QoS Probe: ERROR: ProbeError=4803
When the QoS server is enabled, on TIMEOUT, the controller will smart kill the server and restart. A timeout can happen in either of the following cases:
  • The NSFDbOpen or NIFOpenCollection calls used by the probe return Domino's ERR_TIMEOUT error. This error is sent to the controller and a smart kill/restart is initiated. The controller does not receive a message from qosprobe within the timeout period (QOS_PROBE_TIMEOUT). This can happen in one of the following ways: qosprobe was told to quit ('tell qosprobe quit') or is not running. qosprobe becomes hung while probing.

If the controller receives a probe timeout, it may not initiate a server kill/restart because long running and/or load intensive operations are running (and thus may have caused the probe to time out). These operations include BACKUP, COMPACT, DBCOPY, FIXUP and DBPURGE. In these cases, you see the messages like the following ones in the qoscntrlrtimestamp.out file:

2010/01/07 07:42:56 QoS Controller: The controller has received a probe timeout.
2010/01/07 07:42:56 QoS Controller: There are long running applications - probing will pause until they have completed.

If this condition is detected, the controller will then allow the lengthy ("long-running") operation more time to complete. If any lengthy operation fails to complete within that amount of time, the controller will then proceed with the smart kill/restart. You see a message like the one in the following example in the qoscntrlrtimestamp.out file:

2010/01/07 07:42:56 QoS Controller: Applications are not making progress.
Important: For the following six NOTES.INI values, if you do not configure the value, or configure it as less than the default, the default value applies. You can only change the value to be greater than the default.
  • QOS_PROBE_INTERVAL
  • QOS_PROBE_TIMEOUT
  • QOS_RESTART_LIMIT_PERIOD
  • QOS_SHUTDOWN_TIMEOUT
  • QOS_RESTART_TIMEOUT
  • QOS_APPS_TIMEOUT

Procedure

Perform the following tasks: