
Originally posted on the VMware vSphere Blog.
If you’ve done any research into the high-availability options available for vCenter Server 6.0, hopefully you have had a chance to read the VMware vCenter Server 6.0 Availability Guide written in collaboration with Technical Marketing and Global Support Services as well as KB 1024051. And you might have noticed particular sections that refer to the vCenter Server Watchdog. But what exactly is the vCenter Server Watchdog?
Enabled “out of the box” in 6.0, the vCenter Server Watchdog provides better availability by periodically verifying the status of vCenter Server. It does this in two ways:
If any services fail, the Watchdog attempts to restart them. If it cannot restart the service because of a host failure, vSphere HA restarts the virtual machine running the service on a new host.
That’s sounds slick, right? Well, let’s dive in and take a look at each of these watchdogs in detail.
First up, is the PID Watchdog. A watchdog initializes alongside each vCenter Server service at runtime. The PID Watchdog only monitors services that are actively running. Meaning, that once a service is gracefuly stopped, the watchdog will no longer monitor or restart the service. The PID Watchdog detects only that a process with the correct executable is in the process table but it does not determine if the process is ready to service requests (e.g. vSphere Web Client.) – that is left to the API Watchdog (more on that later).
There are four PID Watchdogs found vCenter Server 6.0 depending on the service it protects and the platform form factor:
Each vCenter Server process has a separate PID Watchdog process associated with it. In this post, we will take a look at those in the vCenter Server Appliance.
This watchdog is a shell script (/usr/bin/watchdog) found on the VCSA that is used to detect a service failure in non-Java (C) based services on the appliance form factor. A service start automatically starts the Watchdog along with the service itself. Let’s search for the running processes that match for “vmware-watchdog.”
mgmt01vc01.sddc.local:~ # ps -ef | grep vmware-watchdog
root 5767 1 0 16:20 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -s rhttpproxy -u 30 -q 5 /usr/sbin/rhttpproxy -r /etc/vmware-rhttpproxy/config.xml -d /etc/vmware-rhttpproxy/endpoints.conf.d -f /etc/vmware-rhttpproxy/endpo
root 7930 1 0 16:21 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -s vws -u 30 -q 5 /usr/lib/vmware-vws/bin/vws.sh
root 8109 1 0 16:21 ? 00:00:11 /bin/sh /usr/bin/vmware-watchdog -s syslog -u 30 -q 5 -b /var/run/rsyslogd.pid /sbin/rsyslogd -c 5 -f /etc/vmware-rsyslog.conf
root 8266 1 0 16:21 ? 00:00:11 /bin/sh /usr/bin/vmware-watchdog -b /storage/db/vpostgres/postmaster.pid -u 300 -q 2 -s vmware-vpostgres su -s /bin/bash vpostgres -c 'LD_LIBRARY_PATH=/opt/vmware/vpostgres/curre
root 9422 1 0 16:21 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -a -s vpxd -u 3600 -q 2 /usr/sbin/vpxd
root 13113 1 0 16:22 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -s vsan-health -u 600 -q 10 su vsan-health -c '/opt/vmware/bin/vmware-vsan-health /usr/lib/vmware-vpx/vsan-health/VsanHealthServer.py -p 8006'
root 28775 19850 0 20:42 pts/0 00:00:00 grep vmware-watchdog
Let’s break that down a bit into something more readable:
Service | Process Name | Virtual Machine Restart? | Minimal Uptime | Number of Restarts |
---|---|---|---|---|
Reverse Proxy | rhttpproxy | No | 30 seconds | 5 |
vCenter Management Web Service | vws | No | 30 seconds | 5 |
Syslog Collector | Syslog | No | 30 seconds | 5 |
vPostgres Database | vmware-vpostgres | No | 5 minutes | 2 |
vCenter Server | vpxd | Yes | 1 hour | 2 |
VSAN Health Check | vsan-health | No | 10 minutes | 10 |
As an example, here we can see that vmware-watchdog is running with a couple of parameters, which differ for each service process. Let’s dig into the VPXD process since it’s the most important service. It shows the following parameters:
-a
-s vpxd
-u 3600
-q 2
What the above process parameters result in is the following: the service, named vpxd (-s vpxd) is started, is monitored for failures and will be restarted twice (-q 2) at most. If it fails for a third time within a minimal uptime of 3600 seconds/one hour (-u 3600) the virtual machine will be restarted/rebooted (-a).
A full list of the parameters that may be used by the vmware-watchdog is provided below:
-d = DAEMONIZE
-n = QUIET
-b = BG_PROC-s = START
-k = STOP
-r = QUERY
-a = REBOOT_FLAG
-u = MIN_UPTIME
-q = MAX_QUICK_FAILURES
-t = MAX_TOTAL_FAILURES
-i = IMMORTAL
-c = CLEANUP_CMD
-f = EXIT_CLEANUP_CMD
The Java Service Wrapper is a watchdog used to detect service failures and restart Java based services. It is based off the Tanuki Java Service Wrapper, a 3rd party service wrapper that enables a Java Application to be run as a Windows Service or UNIX Daemon and allows for the health monitoring an application and JVM. Let’s search for the running processes that match for “tanuki.”
mgmt01vc01.sddc.local:~ # ps -ef | grep tanuki
cm 6331 6324 0 16:20 ? 00:00:37 /usr/java/jre-vmware/bin/vmware-cm -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -XX:+ForceTimeHighResolution -Dlog4j.configuration=cm-log4j.properties -Dxml.config=ht
root 6876 6869 0 16:20 ? 00:00:50 /usr/java/jre-vmware/bin/vmware-cis-license -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -Dorg.apache.catalina.startup.EXIT_ON_INIT_FAILURE=TRUE -XX:+ForceTimeHighRes
root 7267 7258 1 16:21 ? 00:03:45 /usr/java/jre-vmware/bin/vmware-sca -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -XX:+ForceTimeHighResolution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lo
root 7634 7627 0 16:21 ? 00:00:29 /usr/java/jre-vmware/bin/vmware-syslog-health -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -XX:+ForceTimeHighResolution -DlogDir=/var/log/vmware/syslog -Dlog4j.config
1002 7843 7836 0 16:21 ? 00:01:08 /usr/java/jre-vmware/bin/vmware-vapi-endpoint -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -XX:+ForceTimeHighResolution -Dlog4j.configuration=file:/etc/vmware-vapi//e
root 8453 8433 1 16:21 ? 00:02:55 /usr/java/jre-vmware/bin/vmware-invsvc -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -Dvim.logdir=/var/log/vmware/invsvc -XX:+ForceTimeHighResolution -XX:+ParallelRefP
root 10410 10388 0 16:22 ? 00:00:54 /usr/java/jre-vmware/bin/vmware-sps -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -Dxml.config=../conf/sps-spring-config.xml -Dpbm.config=../conf/pbm-spring-config.xml
1007 11099 11088 0 16:22 ? 00:01:27 /usr/java/jre-vmware/bin/vmware-vpx-workflow -Dorg.tanukisoftware.wrapper.WrapperSimpleApp.waitForStartMain=FALSE -XX:+ForceTimeHighResolution -ea -Dlog4j.configuration=conf/workflow.log4j.p
root 23423 19850 0 20:41 pts/0 00:00:00 grep tanuki
When a Java-based services for vCenter Server starts, it automatically starts a wrapper process to monitor the service process and its JVM. The wrapper process restarts the JVM if it crashed and if the wrapper process crashes.
Now, let’s dig into the Likewise Service Manager watchdog. The a 3rd party Likewise Open stack from BeyondTrust includes the Likewise Service Manager and is instantiated by the lwsmd daemon. Aside from the services that come with the Likewise stack (such as lsass, netlogon, lwio, …), the Likewise Service anager is responsible for the VMware Directory Service (vmdir), VMware Authentication Framework (vmafd, which contains VECS), and VMware Certificate Authority (vmca).
Likewise Service Manager monitors the processes it starts and will restart them if they terminate unexpectedly. This means if a service like vmdir crashes, exits due to an error, or is told to terminate it by a process outside of the Likewise Service Manager, it will be restarted.
The command to list, start, stop and restart services managed by the Likewise Service Manager is ‘/opt/likewise/bin/lwsm’.
Let’s list all the processes managed by Likewise Service Manager that match for “vm” and their status:
mgmt01vc01.sddc.local:~ # /opt/likewise/bin/lwsm list | grep vm
vmafd running (standalone: 5505)
vmca running (standalone: 5560)
vmdir running (standalone: 5600)
Here we see that VMware Authentication Framework (+VECS), VMware Certificate Authority and VMware Directory Services are up and running.
Additional commands for Likewise Service Manager daemon include:
list List all known services and their status
autostart Start all services configured for autostart
start-only Start a service
start service Start a service and all dependencies
stop-only service Stop a service
stop service Stop a service and all running dependents
restart service Restart a service and all running dependents
refresh service Refresh service configuration
proxy service Act as a proxy process for a service
info service Get information about a service
status service Get the status of a service
gdb service Attach gdb to the specified running service
Now let’s take a look at the info for the VMware Directory Service using the info command:
mgmt01vc01.sddc.local:~ # /opt/likewise/bin/lwsm info vmdir
Service: vmdir
Description: VMware Directory Service
Type: executable
Autostart: yes
Path: /usr/lib/vmware-vmdir/sbin/vmdird
Arguments: '/usr/lib/vmware-vmdir/sbin/vmdird' '-s' '-l' '0' '-f' '/usr/lib/vmware-vmdir/share/config/vmdirschema.ldif'
Environment:
Dependencies: lsass dcerpc vmafd
Notice that Likewise Service Manager is also aware of any dependencies and will stop / start those as needed.
Next up, is the API Watchdog. This watchdog checks the status of APIs for the VPXD service. If the APIs are not running, the API Watchdog will attempt to restart the service two times. If the service restarts do not solve the issue, the API Watchdog will call for the restart of the virtual machine.
During an initial deployment of the VCSA – new or upgrade- the API Watchdog is in a maintenance mode and will only be active after all of the ‘firstboots’ are completed and all services have come online. Firstboots are the scripts where vCenter services are injected into the VCSA during the appliance deployment. After which, every 5 minutes the watchdog is invoked to verify that the API for VPXD is accessible. Essentially this watchdog is using the VIM API to authenticate to VPXD, request the Tag type associated to the rootFolder property on the Folder managed object. If the check fails the API Watchdog will call the PID Watchdog to restart VPXD. (Note: For a vCenter Server deployment on Windows you must reboot . )
Thus, when the PID Watchdog takes over and performs the actions follow in conjunction with the API Watchdog:
Before a service restart and also before a virtual machine reboot, the API Watchdog generates support bundles for further investigation. These support bundles are stored in /storage/core/*.tgz on VCSA and in C:\ProgramData\VMware\vCenterServer\data\core*.tgz on vCenter Server for Windows form factor.
The API Watchdog is also referred to as “IIAD” (Interservice Interrogation and Activation Daemon.) The configuration settings for the API Watchdog are stored in a JSON file named “iiad.json” and can be found in the /etc/vmware/ on the VCSA or C:\ProgramData\VMware\vCenterServer\cfg\iiad.json on the Windows form factor.
Let’s take a look at the contents of the iiad.json configuration file:
mgmt01vc01.sddc.local:/ # cat /etc/vmware/iiad.json
{
"requestTimeout": 20,
"hysteresisCount": 4,
"remediatedHysteresisCount": 6,
"rebootShellCmd": null,
"restartShellCmd": null,
"maxTotalFailures": 50,
"needShellOnWin": true,
"watchdogDisabled": false,
"vpxd.watchdogDisabled": false,
"createSupportBundle": true,
"automaticServiceRestart": true,
"automaticSystemReboot": false,
"maxSingleRestarts": 3,
"maxSingleFailures": 2
}
So what exactly do these parameters mean? Let’s take a look at each of these below:
Note: The details provided above are for education purposes only. Do not make changes to any of the above parameters unless instructed to do so by VMware Global Support Services.
In addition to the support bundles before a service restart or a virtual machine reboot, it also logs its activities to /var/log/vmware/iiad/* on the VCSA and to %VMWARE_LOG_DIR%/iiad/* on the Windows form factor.
And there you have it, a little bit of a deep dive on the “out of the box” watchdog functionality for vCenter Server 6.0. By managing and periodically verifying the status of vCenter Server processes with the PID Watchdog and the vCenter Server API with the API Watchdog, we provide better availability for the services and a faster RTO.