This post is the first in a series on parsing the ESXi log files and using the ESX console - primarily parsing vmkernel logs.
This post will deal strictly with storage related events that appear in the vmkernel and vobd logs. The focus of the parsing
commands will be looking for events that indicate some sort of problem - either a performance issue or some sort of failure.
There are a number of tools out there that can anayze the various ESXi logs, however, I've found that there are frequently
dynamic situations that arise which require a quick analysis of the logs for events that may indicate a problem. These parsing
scripts (one-line commands really), are intended to be a sort of "swiss army knife" you can use to slice and dice the logs for
particular types of errors or warnings.
What I find most useful about them, is that some of the commands parse out the number of occurrences of a particular error
or warning. If you have, say three storage related warnings in your logs over the last month, it's probably nothing to worry about.
If you have 100 events, it may require further investigation. If you have 100 events for the same LUN, then it probably requires
immediate investigation.
I hope that you find some of these useful. They've been extremely useful for me over the last couple years as I added to them
each time I encountered a new type of significant error or warning. Many of these commands may be valid for ESXi 6.0 but I have
not
had the time to test them all out, so please feel free to post in the
comments on your observations with ESXi 6.0 and above.
Latency in Logs
The following commands report on the properties of the
LUNs reporting latency
Note: The formatting for this
posting may be adjusted. Please be patient while it is worked out. The
content is all valid nevertheless.
High Latency Events in Current vmkernel log
less
/var/log/vmkernel.log | grep "I/O latency"
Sample Output
2015-10-04T06:50:46.103Z
cpu48:33785)WARNING: ScsiDeviceIO: 1223: Device
naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency
increased from average value of 3399 microseconds to 68621 microseconds.
2015-10-04T06:59:11.887Z
cpu8:33783)WARNING: ScsiDeviceIO: 1223: Device
naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency
increased from average value of 3400 microseconds to 141513 microseconds.
2015-10-04T07:25:47.971Z
cpu42:33785)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411
performance has improved. I/O latency reduced from 141513 microseconds to 28281
microseconds.
2015-10-04T07:51:24.120Z
cpu34:33783)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411
performance has improved. I/O latency reduced from 28281 microseconds to 6663
microseconds.
2015-10-04T11:56:07.092Z
cpu44:3
High Latency events in all logs including gzipped
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "I/O latency";done | less
Sample Output
2015-10-03T12:05:09.793Z
cpu32:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846
performance has improved. I/O latency reduced from 98388 microseconds to 19519
microseconds.
2015-10-03T12:06:01.464Z
cpu49:33785)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846
performance has improved. I/O latency reduced from 19519 microseconds to 9704
microseconds.
2015-10-02T08:10:31.272Z
cpu50:33783)WARNING: ScsiDeviceIO: 1223: Device
naa.600009800003322100713533033303533 performance has deteriorated. I/O latency
increased from average value of 4331 microseconds to 86771 microseconds.
2015-10-02T08:14:57.857Z
cpu51:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533
performance has improved. I/O latency reduced from 86771 microseconds to 17009
microseconds.
2015-10-02T08:16:00.673Z
cpu47:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533
performance has improved. I/O latency reduced from 17009 microseconds to 8677
microseconds.
Count of latency events in all logs
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "I/O latency";done | wc -l
Sample Output
99
Count of latency increase events per LUN ID in current
log
cat
/var/log/vmkernel.log | grep -i scsideviceio | grep naa | grep -v
"performance has" | cut -d ' ' -f 13 | sort | uniq -c
3
"naa.60000880000289202113533031463734"
1
"naa.600009800003322100713533034434241"
1
"naa.600009800003322100713533035314533"
1
"naa.600009800003322100713533035413845"
Note: This filters
out the messages for latency increases from ones which report latency
improvements.
Count of latency increase events per LUN ID in all
vmkernel logs
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep -i scsideviceio | grep naa.60000 | grep -v "performance
has" | cut -d ' ' -f 13 | sort | uniq -c ;done
2
"naa.60000880000289202113533030334436"
5
"naa.60000880000289202113533030373445"
2
"naa.60000880000289202113533030383341"
69
"naa.60000880000289202113533030393644"
2
"naa.60000880000289202113533031414345"
Note: In the
following sections the commands to parse the current vmkernel log and all
available vmkernel logs will be listed together.
SCSI Errors in Logs
The following commands report on the properties of the
LUNs reporting SCSI command errors
Count of SCSI error per SCSI Opcode
cat
/var/log/vmkernel.log | grep -i scsideviceio | grep naa | grep -v
"performance has" | cut -d ' ' -f 5 | sort | uniq -c
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep -i scsideviceio | grep naa | grep -v "performance has"
| cut -d ' ' -f 5 | sort | uniq -c ;done
Sample Output
38 0x28,
26 0x2a,
104 0x85,
68 0xfe,
Where the first
number is the count and the second is the SCSI Opcode.
Note: The SCSI
Opcodes are based on the T10 standards. A few of the more common ones are
listed here:
SCSI Opcode
|
Command Details
|
0x28
|
Read
|
0x2a
|
Write
|
0x85
|
ATA Pass-through
|
0x89
|
SCSI Compare and
Write - ATS
|
0x41
|
WRITE SAME(10)
|
0x93
|
WRITE SAME(16)
|
As for the 0xfe
opcode, which I have observed often in the logs that I have analyzed, I have
not been able to pin it down yet, but my suspicion is that it is some sort of
VAAI specific Opcode for devices manufactured by EMC.
Exclude VAAI command errors
The following two
filters are based on the assumption that 0xfe is a valid VAAI command Opcode
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa |
grep -v "performance has" | grep -v "0xfe" | cut -d ' ' -f 13 | sort | uniq -c;done
Sample Output
2
"naa.60000880000289202113533030334436"
2
"naa.60000880000289202113533030373445"
2
"naa.60000880000289202113533030383341"
2
"naa.60000880000289202113533030443130"
32
"naa.60000880000289202113533030444537"
Include only VAAI command errors - view entries
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa |
grep "0xfe" ;done
Sample Output
2015-10-03T06:33:57.661Z
cpu27:33785)ScsiDeviceIO: 2325: Cmd(0x412ec369ed40) 0xfe, CmdSN 0x44b41a from
world 32837 to dev "naa.60000880000289202113533031414145" failed
H:0x0 D:0x2 P:0x5 Possible sense data: 0x2 0x3a 0x2.
2015-10-03T09:15:51.662Z
cpu54:33785)ScsiDeviceIO: 2325: Cmd(0x413ac33a0080) 0xfe, CmdSN 0x44cb42 from
world 32837 to dev "naa.60000880000289202113533031414145" failed
H:0x0 D:0x2 P:0x5 Possible sense data: 0x5 0x20 0x0.
2015-10-03T10:00:57.663Z
cpu49:33783)ScsiDeviceIO: 2325: Cmd(0x413ac373ea80) 0xfe, CmdSN 0x44d174 from
world 32837 to dev "naa.60000880000289202113533031414145" failed
H:0x0 D:0x2 P:0x5
Count only VAAI command errors per LUN
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa |
grep -v "performance has" | grep "0xfe" | cut -d ' ' -f 13 | sort | uniq -c;done
Sample Output
37
"naa.60000880000289202113533030444537"
4
"naa.60000880000289202113533031353639"
27
"naa.60000880000289202113533031414145"
Count Errors Host/Device/Sense Code
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "Valid sense" | cut -d ' ' -f 15-23 | sort | uniq -c
;done
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "Possible sense" | cut -d ' ' -f 15-23 | sort | uniq -c
;done
Sample Output
5109 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5
0x20 0x0.
8252 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5
0x24 0x0.
18 H:0x0 D:0x2 P:0x0 Valid sense data: 0x6
0x3f 0xe.
2 H:0x0 D:0x2 P:0x0 Valid sense data: 0xb
0x44 0x0.
37 H:0x0 D:0x2 P:0x0 Valid sense data: 0xe
0x1d 0x0.
4 permanently unavailable. H:0x0 D:0x2
P:0x0 Valid sense data: 0x5
Note: This one is
particularly useful when combined with the SCSI sense code decoder such as the
one that was posted last week on this site or the one by Florian Grehl at
Virten.net.
List which SCSI command Serial Numbers failed
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "Valid sense" | cut -d ' ' -f 4,15-23 | sort ;done
Sample Output
Cmd(0x413cc81857c0)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8185cc0)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Cmd(0x413cc8185cc0)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8185cc0)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186940)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Cmd(0x413cc8186940)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186940)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186bc0)
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Path permanently
unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5
Path permanently
unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5
Note: You can see
from this that there are multiple occurrences of the same SCSI command serial
number in the logs. I'm still working out the significance of this. Feel free
to post in the comments if you have something to add.
ATS Miscompare
for i in
'/var/log/vmkernel*.*'; do cat $i | grep
"ATS Miscompare";done | less
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "ATS Miscompare";done | less
Sample output
2015-10-02T14:53:48.632Z
cpu19:33783)HBX: 1930: ATS Miscompare detected between test and set HB images
at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-01T08:01:12.557Z
cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images
at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-01T13:49:42.585Z
cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images
at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
*Note: This is
helpful in parsing logs related to the recent KB articles posted about issues
when using ESXi 5.5 Update 2 as noted in the article below and many recent blog
entries.
Count of "Reclaimed heartbeat" events
for i in
'/var/log/vmkernel*.*'; do cat $i | grep
"Reclaimed heartbeat";done | less
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "Reclaimed heartbeat";done | less
Sample Output
2015-10-03T10:00:57.868Z
cpu56:32921)HBX: 270: Reclaimed heartbeat for volume
51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset
3825664
2015-10-02T06:16:06.813Z
cpu48:32921)HBX: 270: Reclaimed heartbeat for volume
51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset
3825664
2015-10-02T08:52:33.618Z
cpu32:32919)HBX: 270: Reclaimed heartbeat for volume
51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset
3825664
Note: These events
indicate some problem with the heartbeats to a datastore being retried which is
not normal behavior.
Generic filter for Heartbeat related events
~ # for i in
'/var/run/log/vmkernel*.gz'; do zcat $i
| grep "HB";done | less
Sample Output
2015-10-03T06:33:57.661Z
cpu27:33785)HBX: 1930: ATS Miscompare detected between test and set HB images
at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-03T06:33:57.661Z
cpu27:33785)HBX: 1931: Test version: [HB state abcdef02 offset 3825664 gen 3125
stampUS 8070728625033 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB
497889> drv 14.60]
2015-10-03T06:33:57.661Z
cpu27:33785)HBX: 1932: Set version: [HB state abcdef02 offset 3825664 gen 3125
stampUS 8070731625055 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB
497889> drv 14.60]
2015-10-03T06:33:57.762Z
cpu43:32919)HBX: 270: Reclaimed heartbeat for volume
51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset
3825664
2015-10-03T06:33:57.762Z
cpu43:32919)[HB state abcdef02 offset 3825664 gen 3125 stampUS 8070731734593
uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]
2015-10-03T09:15:51.662Z
cpu54:33785)HBX: 1930: ATS Miscompare detected between test and set HB images
at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
Note: I'm still
researching what some of these messages mean. The VMWare KB's and web searches
in general turn up very limited information on exactly what these messages
indicate.
FC Blocked
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep 'blocked
FC';done
Note: These errors
can occur when there is a problem with multiple FC paths being down or if the
HBA and LUN queues fill up. An example where this can occur is an FC array
level failure or multiple LUN trespasses on an ALUA based array.
Bus reset issued at ESX host level
less
/var/log/vmkernel.log | grep -i "failed H:0x8";done
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i | grep "H:0x8";done
Note: This is an
interesting SCSI sense code to watch for. It means that the host has determined
that there is a serious problem with the LUN in question and has issues a SCSI
bus reset command to it in an attempt to bring it back to normal operation.
VSCSI reset request
for i in
'/var/run/log/vmkernel*.gz'; do zcat $i | grep 'Reset request ';done
Sample Output
2015-09-27T16:05:24.476Z
cpu34:7822793)VSCSI: 2447: handle 8502(vscsi0:1):Reset request on FSS handle
1413303666 (0 outstanding commands) from (vmm0:ESXVM2010.somedomain.com)
Note: This indicates
that a SCSI reset request was submitted by the VM itself to one of the LUNs on
either and RDM or a datastore. The conditions under which the LUN reset is
issues are OS dependent, but Windows 2003 and above will issue LUN reset commands
if too many SCSI commands queue up within the OS and are not acknowledged
within a set amount of time. In the case of Windows 2008 and above, this
requires several minutes to trigger, so clearly indicates a serious problem.
VOBD Log errors
Generic "lost" errors
cat /var/log/vobd.og
| grep -i "lost"
for i in
'/var/run/log/vobd*.gz'; do zcat $i |
grep -i "lost";done
Sample Output
2015-04-18T01:36:28.187Z:
[vmfsCorrelator] 6479770535964us: [vob.vmfs.ats.support.lost] ATS-Only VMFS
volume 'VMFS-ESXT2-LUN02' is not mounted. This host does not support ATS or ATS
initialization has failed.
2015-02-09T15:05:25.285Z:
[scsiCorrelator] 653098960126us: [esx.problem.storage.redundancy.lost] Lost
path redundancy to storage device naa.60000480000834600113533036414143. Path
vmhba2:C0:T0:L56 is down. Affected datastores: Unknown.
2015-02-09T15:05:25.290Z:
[scsiCorrelator] 653098965819us: [esx.problem.storage.connectivity.lost] Lost
connectivity to storage device naa.60000480000234600112533036414143. Path
vmhba4:C0:T0:L56 is down. Affected datastores: Unknown.
Note: The top ATS
related error is mentioned in the article further up in this posting about the
issue with ATS heartbeats on ESXi 5.5 Update 2.
Specific "lost" errors
for i in
'/var/run/log/vobd*.gz'; do zcat $i |
grep -i "lost connectivity";done
for i in
'/var/run/log/vobd*.gz'; do zcat $i |
grep -i "lost path redundancy";done
for i in
'/var/run/log/vobd*.gz'; do zcat $i |
grep -i "ATS initialization has failed";done
Find LUNs that have been "unmapped"
cat /var/log/vmkernel.log | grep "unmapped" | cut
-d ' ' -f 6 | sort | uniq -c
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep
"unmapped" | cut -d ' ' -f 6 | sort | uniq -c;done
Find LUNs marked permanently inaccessible
cat /var/log/vmkernel.log | grep "permanently
inaccessible" | cut -d ' ' -f 6 | sort | uniq -c
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep
"permanently inaccessible" | cut -d ' ' -f 6 | sort | uniq -c;done
Sample Output
2015-09-29T09:26:45.442Z
cpu18:21545480)WARNING: ScsiDevice: 1503: Device
naa.600604500eb03200559c81366367e511 has been plugged back in after being
marked permanently inaccessible. No data consistency guarantees
Dropped frames on the HBA (assumes Qlogic)
for i in
'/proc/scsi/qla2xxx/*'; do cat $i|grep retried;done
for i in
'/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a'; do $i|grep retried;done
Sample Output
Commands retried
with dropped frame(s) = 0
Commands retried
with dropped frame(s) = 200
Commands retried
with dropped frame(s) = 0
Commands retried
with dropped frame(s) = 1
Note: This
particular example only works for Qlogic based HBAs. I'm working on an
equivalent for Emulex. Basically, though, these numbers should always be zero.
Anything other than
a zero number indicates a problem at the hardware level, since the frame drops
occur at the physical layer. Possible causes are bad cables, failing GBICs, or
other hardware issues.