After performing an upgrade from ESXi 4.1/ESXi 5.0 to ESXi 5.5 u2 I noticed increased latency events on hosts. More troubling, the affected hosts were frequently dropping all presented datastores, though they would reconnect within a few seconds. The events may appear in your event log as below:
While there are many possible causes to explore these sorts of connectivity issues, one that is often overlooked is how ESXi heartbeating to the datastores. Starting with ESXi 5.5 u2 the default datastore heartbeating protocol switch to Atomic Test and Set (ATS). Previous releases of ESXi used a SCSI reservation method to confirm the datastores were still present.
With ATS, the host sends a heartbeat packet to the datastore every 8 seconds to make sure the datastore is still available. If the storage array is VAAI aware and is compatible with ATS heartbeating this works efficiently. If the array can't identify this heartbeat packet and respond to it very quickly the datastores could fail and you may encounter an "atomic" failure.
Storage arrays that don't use VAAI plugins to identify ATS heartbeats treat those packets as any other storage request. During a point of high I/O, the array will put the ATS heartbeat into the queue and respond to it eventually. ATS is very sensitive to latency so it starts to panic and sends another packet. Eventually it just drops the connection and reconnects.
The datastore exists and is available, it is just doesn't respond to the heartbeat packet storm. It reconnects quickly, however the momentary drop will pause the running VMs. Many applications aren't written to withstand a disconnect of even a few seconds. A pause of the VM can lead to app crashes and cranky end users.
#How Can We Fix It?
In a perfect world you would check with the storage vendor to see if there's an update or a plugin to make it support ATS. This isn't always possible, so luckily for us there is an easy, scriptable way to change this setting. It uses PowerCLI and can change the line on an entire datacenter in one line.
To start, let's check the setting on one host:
Get-VMHost "hostname" | Get-AdvancedSetting -Name VMFS3.UseATSForHBOnVMFS5
This will return a line to tell you if have the feature enabled. In 5.5 u2 and beyond it will be enabled.
Now we can disable the setting on that host:
Get-VMHost | Get-AdvancedSetting -Name VMFS3.UseATSForHBOnVMFS5 | Set-AdvancedSetting -0 -Confirm:$false
The Confirm:$false line is a way to confirm that you want to change this setting for every host when you run the command against the complete datacenter.
Get-Datacenter "datacenter" | Get-VMHost | Get-AdvancedSetting -Name VMFS3.UseATSForHBOnVMFS5 | Set-AdvancedSetting -0 -Confirm:$false
This will identify the datacenter in your vCenter server, get all the hosts in the datacenter, and disable ATS on VMFS5 datastores.
You can find more information on this and other VAAI issues in VMware KB 1033665