Sunday, October 4, 2015

ESXi 5.x Console Scripts - Part 1 - Storage events

This post is the first in a series on parsing the ESXi log files and using the ESX console - primarily parsing vmkernel logs.
This post will deal strictly with storage related events that appear in the vmkernel and vobd logs. The focus of the parsing
commands will be looking for events that indicate some sort of problem - either a performance issue or some sort  of failure.

There are a number of tools out there that can anayze the various ESXi logs, however, I've found that there are frequently
dynamic situations that arise which require a quick analysis of the logs for events that may indicate a problem. These parsing
scripts (one-line commands really), are intended to be a sort of "swiss army knife" you can use to slice and dice the logs for
particular types of errors or warnings.

What I find most useful about them, is that some of the commands parse out the number of occurrences of a particular error
or warning. If you have, say three storage related warnings in your logs over the last month, it's probably nothing to worry about.
If you have 100 events, it may require further investigation. If you have 100 events for the same LUN, then it probably requires
immediate investigation.

I hope that you find some of these useful. They've been extremely useful for me over the last couple years as I added to them
each time I encountered a new type of significant error or warning. Many of these commands may be valid for ESXi 6.0 but I have
 not had the time to test them all out, so please feel free to post in the comments on your observations with ESXi 6.0 and above.

Latency in Logs

The following commands report on the properties of the LUNs reporting latency

Note: The formatting for this posting may be adjusted. Please be patient while it is worked out. The content is all valid nevertheless.

High Latency Events in Current vmkernel log 

less /var/log/vmkernel.log | grep "I/O latency"

Sample Output
2015-10-04T06:50:46.103Z cpu48:33785)WARNING: ScsiDeviceIO: 1223: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency increased from average value of 3399 microseconds to 68621 microseconds.
2015-10-04T06:59:11.887Z cpu8:33783)WARNING: ScsiDeviceIO: 1223: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency increased from average value of 3400 microseconds to 141513 microseconds.
2015-10-04T07:25:47.971Z cpu42:33785)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has improved. I/O latency reduced from 141513 microseconds to 28281 microseconds.
2015-10-04T07:51:24.120Z cpu34:33783)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has improved. I/O latency reduced from 28281 microseconds to 6663 microseconds.
2015-10-04T11:56:07.092Z cpu44:3
  
High Latency events in all logs including gzipped

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "I/O latency";done | less

Sample Output
2015-10-03T12:05:09.793Z cpu32:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846 performance has improved. I/O latency reduced from 98388 microseconds to 19519 microseconds.
2015-10-03T12:06:01.464Z cpu49:33785)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846 performance has improved. I/O latency reduced from 19519 microseconds to 9704 microseconds.
2015-10-02T08:10:31.272Z cpu50:33783)WARNING: ScsiDeviceIO: 1223: Device naa.600009800003322100713533033303533 performance has deteriorated. I/O latency increased from average value of 4331 microseconds to 86771 microseconds.
2015-10-02T08:14:57.857Z cpu51:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533 performance has improved. I/O latency reduced from 86771 microseconds to 17009 microseconds.
2015-10-02T08:16:00.673Z cpu47:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533 performance has improved. I/O latency reduced from 17009 microseconds to 8677 microseconds.


Count of latency events in all logs
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "I/O latency";done | wc -l

Sample Output
99

Count of latency increase events per LUN ID in current log

cat /var/log/vmkernel.log | grep -i scsideviceio | grep naa | grep -v "performance has" | cut -d ' ' -f 13 | sort | uniq -c
      3 "naa.60000880000289202113533031463734"
      1 "naa.600009800003322100713533034434241"
      1 "naa.600009800003322100713533035314533"
      1 "naa.600009800003322100713533035413845"

Note: This filters out the messages for latency increases from ones which report latency improvements.

Count of latency increase events per LUN ID in all vmkernel logs

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep -i scsideviceio | grep naa.60000 | grep -v "performance has" | cut -d ' ' -f 13 | sort | uniq -c ;done
      2 "naa.60000880000289202113533030334436"
      5 "naa.60000880000289202113533030373445"
      2 "naa.60000880000289202113533030383341"
      69 "naa.60000880000289202113533030393644"
      2 "naa.60000880000289202113533031414345"

Note: In the following sections the commands to parse the current vmkernel log and all available vmkernel logs will be listed together.


SCSI Errors in Logs

The following commands report on the properties of the LUNs reporting SCSI command errors

Count of SCSI error per SCSI Opcode

cat /var/log/vmkernel.log | grep -i scsideviceio | grep naa | grep -v "performance has" | cut -d ' ' -f 5 | sort | uniq -c 
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep -i scsideviceio | grep naa | grep -v "performance has" | cut -d ' ' -f 5 | sort | uniq -c ;done

Sample Output

     38 0x28,
     26 0x2a,
    104 0x85,
     68 0xfe,

Where the first number is the count and the second is the SCSI Opcode.

Note: The SCSI Opcodes are based on the T10 standards. A few of the more common ones are listed here:

SCSI Opcode
Command Details
0x28
Read
0x2a
Write
0x85
ATA Pass-through
0x89
SCSI Compare and Write - ATS
0x41
WRITE SAME(10)
0x93
WRITE SAME(16)

As for the 0xfe opcode, which I have observed often in the logs that I have analyzed, I have not been able to pin it down yet, but my suspicion is that it is some sort of VAAI specific Opcode for devices manufactured by EMC.

Exclude VAAI command errors

The following two filters are based on the assumption that 0xfe is a valid VAAI command Opcode

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa | grep -v "performance has" | grep -v "0xfe" |  cut -d ' ' -f 13 | sort | uniq -c;done

Sample Output

      2 "naa.60000880000289202113533030334436"
      2 "naa.60000880000289202113533030373445"
      2 "naa.60000880000289202113533030383341"
      2 "naa.60000880000289202113533030443130"
     32 "naa.60000880000289202113533030444537"
  
Include only VAAI command errors - view entries

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa | grep "0xfe" ;done

Sample Output

2015-10-03T06:33:57.661Z cpu27:33785)ScsiDeviceIO: 2325: Cmd(0x412ec369ed40) 0xfe, CmdSN 0x44b41a from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x2 0x3a 0x2.
2015-10-03T09:15:51.662Z cpu54:33785)ScsiDeviceIO: 2325: Cmd(0x413ac33a0080) 0xfe, CmdSN 0x44cb42 from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x5 0x20 0x0.
2015-10-03T10:00:57.663Z cpu49:33783)ScsiDeviceIO: 2325: Cmd(0x413ac373ea80) 0xfe, CmdSN 0x44d174 from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5


Count only VAAI command errors per LUN

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa | grep -v "performance has" | grep "0xfe" |  cut -d ' ' -f 13 | sort | uniq -c;done

Sample Output

     37 "naa.60000880000289202113533030444537"
      4 "naa.60000880000289202113533031353639"
     27 "naa.60000880000289202113533031414145"

Count Errors Host/Device/Sense Code

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Valid sense" | cut -d ' ' -f 15-23 | sort | uniq -c ;done
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Possible sense" | cut -d ' ' -f 15-23 | sort | uniq -c ;done

Sample Output

   5109 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
   8252 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
     18 H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
      2 H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.
     37 H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
      4 permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5


Note: This one is particularly useful when combined with the SCSI sense code decoder such as the one that was posted last week on this site or the one by Florian Grehl at Virten.net.

List which SCSI command Serial Numbers failed

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Valid sense" | cut -d ' ' -f 4,15-23 | sort ;done

Sample Output

Cmd(0x413cc81857c0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186bc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Path permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5
Path permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5

Note: You can see from this that there are multiple occurrences of the same SCSI command serial number in the logs. I'm still working out the significance of this. Feel free to post in the comments if you have something to add.


ATS Miscompare

for i in '/var/log/vmkernel*.*'; do cat $i  | grep "ATS Miscompare";done | less
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "ATS Miscompare";done | less

Sample output

2015-10-02T14:53:48.632Z cpu19:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-01T08:01:12.557Z cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-01T13:49:42.585Z cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

*Note: This is helpful in parsing logs related to the recent KB articles posted about issues when using ESXi 5.5 Update 2 as noted in the article below and many recent blog entries.



Count of "Reclaimed heartbeat" events

for i in '/var/log/vmkernel*.*'; do cat $i  | grep "Reclaimed heartbeat";done | less
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Reclaimed heartbeat";done | less

Sample Output
2015-10-03T10:00:57.868Z cpu56:32921)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664
2015-10-02T06:16:06.813Z cpu48:32921)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664
2015-10-02T08:52:33.618Z cpu32:32919)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664

Note: These events indicate some problem with the heartbeats to a datastore being retried which is not normal behavior.

Generic filter for Heartbeat related events

~ # for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "HB";done | less

Sample Output

2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1931: Test version: [HB state abcdef02 offset 3825664 gen 3125 stampUS 8070728625033 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]
2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1932: Set version: [HB state abcdef02 offset 3825664 gen 3125 stampUS 8070731625055 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]
2015-10-03T06:33:57.762Z cpu43:32919)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664
2015-10-03T06:33:57.762Z cpu43:32919)[HB state abcdef02 offset 3825664 gen 3125 stampUS 8070731734593 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]
2015-10-03T09:15:51.662Z cpu54:33785)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

Note: I'm still researching what some of these messages mean. The VMWare KB's and web searches in general turn up very limited information on exactly what these messages indicate.

FC Blocked

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep 'blocked FC';done

Note: These errors can occur when there is a problem with multiple FC paths being down or if the HBA and LUN queues fill up. An example where this can occur is an FC array level failure or multiple LUN trespasses on an ALUA based array.


Bus reset issued at ESX host level

less /var/log/vmkernel.log | grep -i "failed H:0x8";done
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "H:0x8";done

Note: This is an interesting SCSI sense code to watch for. It means that the host has determined that there is a serious problem with the LUN in question and has issues a SCSI bus reset command to it in an attempt to bring it back to normal operation.

VSCSI reset request

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep 'Reset request ';done

Sample Output

2015-09-27T16:05:24.476Z cpu34:7822793)VSCSI: 2447: handle 8502(vscsi0:1):Reset request on FSS handle 1413303666 (0 outstanding commands) from (vmm0:ESXVM2010.somedomain.com)

Note: This indicates that a SCSI reset request was submitted by the VM itself to one of the LUNs on either and RDM or a datastore. The conditions under which the LUN reset is issues are OS dependent, but Windows 2003 and above will issue LUN reset commands if too many SCSI commands queue up within the OS and are not acknowledged within a set amount of time. In the case of Windows 2008 and above, this requires several minutes to trigger, so clearly indicates a serious problem.


VOBD Log errors

Generic "lost" errors

cat /var/log/vobd.og |  grep -i "lost"
for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "lost";done

Sample Output

2015-04-18T01:36:28.187Z: [vmfsCorrelator] 6479770535964us: [vob.vmfs.ats.support.lost] ATS-Only VMFS volume 'VMFS-ESXT2-LUN02' is not mounted. This host does not support ATS or ATS initialization has failed.
2015-02-09T15:05:25.285Z: [scsiCorrelator] 653098960126us: [esx.problem.storage.redundancy.lost] Lost path redundancy to storage device naa.60000480000834600113533036414143. Path vmhba2:C0:T0:L56 is down. Affected datastores: Unknown.
2015-02-09T15:05:25.290Z: [scsiCorrelator] 653098965819us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60000480000234600112533036414143. Path vmhba4:C0:T0:L56 is down. Affected datastores: Unknown.

Note: The top ATS related error is mentioned in the article further up in this posting about the issue with ATS heartbeats on ESXi 5.5 Update 2.

Specific "lost" errors

for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "lost connectivity";done
for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "lost path redundancy";done
for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "ATS initialization has failed";done

Find LUNs that have been "unmapped"

cat /var/log/vmkernel.log | grep "unmapped" | cut -d ' ' -f 6 | sort | uniq -c
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "unmapped" | cut -d ' ' -f 6 | sort | uniq -c;done


Find LUNs marked permanently inaccessible

cat /var/log/vmkernel.log | grep "permanently inaccessible" | cut -d ' ' -f 6 | sort | uniq -c
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "permanently inaccessible" | cut -d ' ' -f 6 | sort | uniq -c;done

Sample Output

2015-09-29T09:26:45.442Z cpu18:21545480)WARNING: ScsiDevice: 1503: Device naa.600604500eb03200559c81366367e511 has been plugged back in after being marked permanently inaccessible. No data consistency guarantees


Dropped frames on the HBA (assumes Qlogic)

for i in '/proc/scsi/qla2xxx/*'; do cat $i|grep retried;done

for i in '/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a'; do $i|grep retried;done

Sample Output

Commands retried with dropped frame(s) = 0
Commands retried with dropped frame(s) = 200
Commands retried with dropped frame(s) = 0
Commands retried with dropped frame(s) = 1

Note: This particular example only works for Qlogic based HBAs. I'm working on an equivalent for Emulex. Basically, though, these numbers should always be zero.
Anything other than a zero number indicates a problem at the hardware level, since the frame drops occur at the physical layer. Possible causes are bad cables, failing GBICs, or other hardware issues.

No comments:

Post a Comment