EdZ's VMware Blog

Wednesday, October 7, 2015

ESXi 5.x Console Scripts - Part 2 - Storage Related Queries

LUN Related Queries

The following commands can be used to retrieve information about the LUN details and LUN path configuration on the ESX host. The selected list of commands were chosen based on usefulness. For example, having a count of various types of LUNs, the number of paths per HBA, the total number of paths, and so on.

List unique paths

esxcli storage nmp path list | grep fc | uniq

Count unique paths

esxcli storage nmp path list | grep fc | uniq | wc -l

Note: This number is important. With ESXi 5.x there is a limit of 1024 paths. All types of paths count against this value, whether they are due to additional HBA ports, FA ports, or otherwise. In large environments, it is quite easy to reach this limit.

List path status for each LUN

esxcfg-mpath -b |grep -B1 "fc Adapter"| grep -v -e "--" |sed 's/Adapter.*//'

Sample:

naa.6007898044789300d0e32c98a509f314 : DGC Fibre Channel Disk (naa.6007898044789300d0e32c98a509f314)

vmhba2:C0:T9:L1 LUN:1 state:active fc

vmhba4:C0:T8:L1 LUN:1 state:active fc

vmhba4:C0:T9:L1 LUN:1 state:active fc

vmhba2:C0:T8:L1 LUN:1 state:active fc

naa.60080831873001e007b9f33f83689df11 : DGC Fibre Channel Disk (naa.60080831873001e007b9f33f83689df11)

vmhba5:C0:T1:L23 LUN:23 state:active fc

Path details for a LUN

esxcli storage nmp path list --device <device naa ID> |grep fc

Example:

esxcli storage nmp path list --device naa.6007898044789300d0e32c98a509f314 |grep fc

fc.30003441bd69284:21000043bd69284-fc.60893310bad02cdf:500819443ce02c6f-naa.6007898044789300d0e32c98a509f314

fc.30003441bd691ac:21000043bd691ac-fc.60893310bad02cdf:5008194e3ce02c6f-naa.6007898044789300d0e32c98a509f314

fc.30003441bd691ac:21000043bd691ac-fc.60893310bad02cdf:500819453ce02c6f-naa.6007898044789300d0e32c98a509f314

fc.30003441bd69284:21000043bd69284-fc.60893310bad02cdf:5008194c3ce02c6f-naa.6007898044789300d0e32c98a509f314

Path status for a LUN

esxcli storage nmp path list --device <device naa ID>

Example:

esxcli storage nmp path list --device naa.6007898044789300d0e32c98a509f314

fc.20000033bd69584:21000043bd69284-fc.50030160bad02cdf:500819443ce02c6f-naa.6007898044789300d0e32c98a509f314

Runtime Name: vmhba2:C0:T8:L1

Device: naa.6007898044789300d0e32c98a509f314

Device Display Name: DGC Fibre Channel Disk (naa.6007898044789300d0e32c98a509f314)

Group State: active

Array Priority: 1

Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=5,RTP_health=UP}

Path Selection Policy Path Config: PSP VMW_PSP_RR does not support path configuration.

fc.20000033bd6891ac:21000043be691ac-fc.50030118bad02cdf:5008194c3ce02c6f-naa.6007898044789300d0e32c98a509f314

Runtime Name: vmhba4:C0:T9:L1

Device: naa.6007898044789300d0e32c98a509f314

Device Display Name: DGC Fibre Channel Disk (naa.6007898044789300d0e32c98a509f314)

Group State: active unoptimized

Array Priority: 0

Storage Array Type Path Config: {TPG_id=2,TPG_state=ANO,RTP_id=14,RTP_health=UP}

Path Selection Policy Path Config: PSP VMW_PSP_RR does not support path configuration.

Set Perennial Reservation

Example:

esxcli storage core device setconfig -d naa.6000b79c000317a10a4b8206d2825a83 --perennially-reserved=true

Find LUNs with and without Perennial Reservations

esxcli storage core device list | grep "Is Perennially Reserved: true" -B18 | grep naa

esxcli storage core device list | grep "Is Perennially Reserved: false" -B18 | grep naa

Find "offline" LUNs

esxcli storage core device list | grep "Is Offline: true" -B17 | grep naa

Find permanently detached devices (find permanently inaccessible LUNs)

esxcli storage core device detached list

Find "Status: unknown"

esxcli storage core device list | grep -v "Thin" | grep "Status: unknown" -B12 | grep naa

Total LUN count

esxcli storage core device list | grep Vendor | wc -l

LUN Count by vendor name

Examples:

esxcli storage core device list | grep Vendor | grep EMC | wc -l

esxcli storage core device list | grep Vendor | grep DGC | wc -l

List Vendors

esxcli storage core device list | grep Vendor | sort | uniq

Sample Output:

Vendor: DELL

Vendor: DGC

Vendor: DP

Vendor: EMC

Vendor: TSSTcorp

Vendor: iDRAC

Vendor: IBM

Count of LUNs per Vendor

esxcli storage core device list | grep Vendor | sort | uniq -c

Sample Output:

1 Vendor: DELL

54 Vendor: DGC

1 Vendor: DP

110 Vendor: EMC

87 Vendor: IBM

1 Vendor: TSSTcorp

2 Vendor: iDRAC

Count paths per unique LUN

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f3 | sort| uniq -c

Sample Output:

3 naa.60080831873001e00e26d1945094fe311

3 naa.60080831873001e00f2655ba63689df11

3 naa.60080831873001e00f3655ba63689df11

3 naa.60080831873001e00f4655ba63689df11

4 naa.6006898025782100483e7b4bdbedf314

4 naa.60068980257821005ceed1735c11f314

Count Total Number of paths

esxcli storage nmp path list | grep "fc\." | wc -l

Count Number of LUNs per HBA

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f1 | sort| uniq -c

174 fc.20000043bd691ac:21000033bd691ac

160 fc.20000043bd691ad:21000033bd691ad

130 fc.20000043bd69284:21000033bd69284

Count number of LUNs visible per FA (counts LUNs more than once)

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f2 | sort| uniq -c

110 fc.500009321498a000:500009321498a1dc

1 fc.5000143047324698:50001442a0461500

2 fc.5000143047324698:50001443a0461501

1 fc.500014404770465c:50001442b0465c00

2 fc.500014404770465c:50001442b0465c01

22 fc.500143800200096a:500143810200096a

Notes: Breakdown of this output

fc.20000043bd691ac:21000043bd691ac-fc.50081940bce02c69:500819463ce02c69-naa.60080831873001e00f2655ba63689df11

HBA

LUN

Count of LUNs with a single path

Note: Does not detect multiple paths on a single HBA.

List of LUNs with paths only on a single HBA

Count of LUNs with paths only on a single HBA

Count paths per unique LUN

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f3 | sort| uniq -c

Identify list of HBA's

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f1 | sort | uniq

Sunday, October 4, 2015

ESXi 5.x Console Scripts - Part 1 - Storage events

This post is the first in a series on parsing the ESXi log files and using the ESX console - primarily parsing vmkernel logs.
This post will deal strictly with storage related events that appear in the vmkernel and vobd logs. The focus of the parsing
commands will be looking for events that indicate some sort of problem - either a performance issue or some sort of failure.

There are a number of tools out there that can anayze the various ESXi logs, however, I've found that there are frequently

dynamic situations that arise which require a quick analysis of the logs for events that may indicate a problem. These parsing

scripts (one-line commands really), are intended to be a sort of "swiss army knife" you can use to slice and dice the logs for

particular types of errors or warnings.

What I find most useful about them, is that some of the commands parse out the number of occurrences of a particular error

or warning. If you have, say three storage related warnings in your logs over the last month, it's probably nothing to worry about.

If you have 100 events, it may require further investigation. If you have 100 events for the same LUN, then it probably requires

immediate investigation.

I hope that you find some of these useful. They've been extremely useful for me over the last couple years as I added to them

each time I encountered a new type of significant error or warning. Many of these commands may be valid for ESXi 6.0 but I have

not had the time to test them all out, so please feel free to post in the comments on your observations with ESXi 6.0 and above.

Latency in Logs

The following commands report on the properties of the LUNs reporting latency

Note: The formatting for this posting may be adjusted. Please be patient while it is worked out. The content is all valid nevertheless.

High Latency Events in Current vmkernel log

less /var/log/vmkernel.log | grep "I/O latency"

Sample Output

2015-10-04T06:50:46.103Z cpu48:33785)WARNING: ScsiDeviceIO: 1223: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency increased from average value of 3399 microseconds to 68621 microseconds.

2015-10-04T06:59:11.887Z cpu8:33783)WARNING: ScsiDeviceIO: 1223: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency increased from average value of 3400 microseconds to 141513 microseconds.

2015-10-04T07:25:47.971Z cpu42:33785)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has improved. I/O latency reduced from 141513 microseconds to 28281 microseconds.

2015-10-04T07:51:24.120Z cpu34:33783)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has improved. I/O latency reduced from 28281 microseconds to 6663 microseconds.

2015-10-04T11:56:07.092Z cpu44:3

High Latency events in all logs including gzipped

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "I/O latency";done | less

Sample Output

2015-10-03T12:05:09.793Z cpu32:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846 performance has improved. I/O latency reduced from 98388 microseconds to 19519 microseconds.

2015-10-03T12:06:01.464Z cpu49:33785)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846 performance has improved. I/O latency reduced from 19519 microseconds to 9704 microseconds.

2015-10-02T08:10:31.272Z cpu50:33783)WARNING: ScsiDeviceIO: 1223: Device naa.600009800003322100713533033303533 performance has deteriorated. I/O latency increased from average value of 4331 microseconds to 86771 microseconds.

2015-10-02T08:14:57.857Z cpu51:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533 performance has improved. I/O latency reduced from 86771 microseconds to 17009 microseconds.

2015-10-02T08:16:00.673Z cpu47:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533 performance has improved. I/O latency reduced from 17009 microseconds to 8677 microseconds.

Count of latency events in all logs

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "I/O latency";done | wc -l

Sample Output

Count of latency increase events per LUN ID in current log

3 "naa.60000880000289202113533031463734"

1 "naa.600009800003322100713533034434241"

1 "naa.600009800003322100713533035314533"

1 "naa.600009800003322100713533035413845"

Note: This filters out the messages for latency increases from ones which report latency improvements.

Count of latency increase events per LUN ID in all vmkernel logs

2 "naa.60000880000289202113533030334436"

5 "naa.60000880000289202113533030373445"

2 "naa.60000880000289202113533030383341"

69 "naa.60000880000289202113533030393644"

2 "naa.60000880000289202113533031414345"

Note: In the following sections the commands to parse the current vmkernel log and all available vmkernel logs will be listed together.

SCSI Errors in Logs

The following commands report on the properties of the LUNs reporting SCSI command errors

Count of SCSI error per SCSI Opcode

Sample Output

38 0x28,

26 0x2a,

104 0x85,

68 0xfe,

Where the first number is the count and the second is the SCSI Opcode.

Note: The SCSI Opcodes are based on the T10 standards. A few of the more common ones are listed here:

SCSI Opcode	Command Details
0x28	Read
0x2a	Write
0x85	ATA Pass-through
0x89	SCSI Compare and Write - ATS
0x41	WRITE SAME(10)
0x93	WRITE SAME(16)

As for the 0xfe opcode, which I have observed often in the logs that I have analyzed, I have not been able to pin it down yet, but my suspicion is that it is some sort of VAAI specific Opcode for devices manufactured by EMC.

Exclude VAAI command errors

The following two filters are based on the assumption that 0xfe is a valid VAAI command Opcode

Sample Output

2 "naa.60000880000289202113533030334436"

2 "naa.60000880000289202113533030373445"

2 "naa.60000880000289202113533030383341"

2 "naa.60000880000289202113533030443130"

32 "naa.60000880000289202113533030444537"

Include only VAAI command errors - view entries

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa | grep "0xfe" ;done

Sample Output

2015-10-03T06:33:57.661Z cpu27:33785)ScsiDeviceIO: 2325: Cmd(0x412ec369ed40) 0xfe, CmdSN 0x44b41a from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x2 0x3a 0x2.

2015-10-03T09:15:51.662Z cpu54:33785)ScsiDeviceIO: 2325: Cmd(0x413ac33a0080) 0xfe, CmdSN 0x44cb42 from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x5 0x20 0x0.

2015-10-03T10:00:57.663Z cpu49:33783)ScsiDeviceIO: 2325: Cmd(0x413ac373ea80) 0xfe, CmdSN 0x44d174 from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5

Count only VAAI command errors per LUN

Sample Output

37 "naa.60000880000289202113533030444537"

4 "naa.60000880000289202113533031353639"

27 "naa.60000880000289202113533031414145"

Count Errors Host/Device/Sense Code

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "Valid sense" | cut -d ' ' -f 15-23 | sort | uniq -c ;done

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "Possible sense" | cut -d ' ' -f 15-23 | sort | uniq -c ;done

Sample Output

5109 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

8252 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

18 H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.

2 H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

37 H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

4 permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5

Note: This one is particularly useful when combined with the SCSI sense code decoder such as the one that was posted last week on this site or the one by Florian Grehl at Virten.net.

List which SCSI command Serial Numbers failed

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "Valid sense" | cut -d ' ' -f 4,15-23 | sort ;done

Sample Output

Cmd(0x413cc81857c0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Cmd(0x413cc8186bc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

Path permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5

Note: You can see from this that there are multiple occurrences of the same SCSI command serial number in the logs. I'm still working out the significance of this. Feel free to post in the comments if you have something to add.

ATS Miscompare

for i in '/var/log/vmkernel*.*'; do cat $i | grep "ATS Miscompare";done | less

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "ATS Miscompare";done | less

Sample output

2015-10-02T14:53:48.632Z cpu19:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

2015-10-01T08:01:12.557Z cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

2015-10-01T13:49:42.585Z cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

*Note: This is helpful in parsing logs related to the recent KB articles posted about issues when using ESXi 5.5 Update 2 as noted in the article below and many recent blog entries.

Enabling or disabling VAAI ATS heartbeat (2113956)

Count of "Reclaimed heartbeat" events

for i in '/var/log/vmkernel*.*'; do cat $i | grep "Reclaimed heartbeat";done | less

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "Reclaimed heartbeat";done | less

Sample Output

2015-10-03T10:00:57.868Z cpu56:32921)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664

2015-10-02T06:16:06.813Z cpu48:32921)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664

2015-10-02T08:52:33.618Z cpu32:32919)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664

Note: These events indicate some problem with the heartbeats to a datastore being retried which is not normal behavior.

Generic filter for Heartbeat related events

~ # for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "HB";done | less

Sample Output

2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1931: Test version: [HB state abcdef02 offset 3825664 gen 3125 stampUS 8070728625033 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]

2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1932: Set version: [HB state abcdef02 offset 3825664 gen 3125 stampUS 8070731625055 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]

2015-10-03T06:33:57.762Z cpu43:32919)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664

2015-10-03T06:33:57.762Z cpu43:32919)[HB state abcdef02 offset 3825664 gen 3125 stampUS 8070731734593 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]

2015-10-03T09:15:51.662Z cpu54:33785)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

Note: I'm still researching what some of these messages mean. The VMWare KB's and web searches in general turn up very limited information on exactly what these messages indicate.

FC Blocked

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep 'blocked FC';done

Note: These errors can occur when there is a problem with multiple FC paths being down or if the HBA and LUN queues fill up. An example where this can occur is an FC array level failure or multiple LUN trespasses on an ALUA based array.

Bus reset issued at ESX host level

less /var/log/vmkernel.log | grep -i "failed H:0x8";done

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "H:0x8";done

Note: This is an interesting SCSI sense code to watch for. It means that the host has determined that there is a serious problem with the LUN in question and has issues a SCSI bus reset command to it in an attempt to bring it back to normal operation.

VSCSI reset request

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep 'Reset request ';done

Sample Output

2015-09-27T16:05:24.476Z cpu34:7822793)VSCSI: 2447: handle 8502(vscsi0:1):Reset request on FSS handle 1413303666 (0 outstanding commands) from (vmm0:ESXVM2010.somedomain.com)

Note: This indicates that a SCSI reset request was submitted by the VM itself to one of the LUNs on either and RDM or a datastore. The conditions under which the LUN reset is issues are OS dependent, but Windows 2003 and above will issue LUN reset commands if too many SCSI commands queue up within the OS and are not acknowledged within a set amount of time. In the case of Windows 2008 and above, this requires several minutes to trigger, so clearly indicates a serious problem.

VOBD Log errors

Generic "lost" errors

cat /var/log/vobd.og | grep -i "lost"

for i in '/var/run/log/vobd*.gz'; do zcat $i | grep -i "lost";done

Sample Output

2015-04-18T01:36:28.187Z: [vmfsCorrelator] 6479770535964us: [vob.vmfs.ats.support.lost] ATS-Only VMFS volume 'VMFS-ESXT2-LUN02' is not mounted. This host does not support ATS or ATS initialization has failed.

2015-02-09T15:05:25.285Z: [scsiCorrelator] 653098960126us: [esx.problem.storage.redundancy.lost] Lost path redundancy to storage device naa.60000480000834600113533036414143. Path vmhba2:C0:T0:L56 is down. Affected datastores: Unknown.

2015-02-09T15:05:25.290Z: [scsiCorrelator] 653098965819us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60000480000234600112533036414143. Path vmhba4:C0:T0:L56 is down. Affected datastores: Unknown.

Note: The top ATS related error is mentioned in the article further up in this posting about the issue with ATS heartbeats on ESXi 5.5 Update 2.

Specific "lost" errors

for i in '/var/run/log/vobd*.gz'; do zcat $i | grep -i "lost connectivity";done

for i in '/var/run/log/vobd*.gz'; do zcat $i | grep -i "lost path redundancy";done

for i in '/var/run/log/vobd*.gz'; do zcat $i | grep -i "ATS initialization has failed";done

Find LUNs that have been "unmapped"

cat /var/log/vmkernel.log | grep "unmapped" | cut -d ' ' -f 6 | sort | uniq -c

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "unmapped" | cut -d ' ' -f 6 | sort | uniq -c;done

Find LUNs marked permanently inaccessible

cat /var/log/vmkernel.log | grep "permanently inaccessible" | cut -d ' ' -f 6 | sort | uniq -c

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "permanently inaccessible" | cut -d ' ' -f 6 | sort | uniq -c;done

Sample Output

2015-09-29T09:26:45.442Z cpu18:21545480)WARNING: ScsiDevice: 1503: Device naa.600604500eb03200559c81366367e511 has been plugged back in after being marked permanently inaccessible. No data consistency guarantees

Dropped frames on the HBA (assumes Qlogic)

for i in '/proc/scsi/qla2xxx/*'; do cat $i|grep retried;done

for i in '/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a'; do $i|grep retried;done

Sample Output

Commands retried with dropped frame(s) = 0

Commands retried with dropped frame(s) = 200

Commands retried with dropped frame(s) = 0

Commands retried with dropped frame(s) = 1

Note: This particular example only works for Qlogic based HBAs. I'm working on an equivalent for Emulex. Basically, though, these numbers should always be zero.

Anything other than a zero number indicates a problem at the hardware level, since the frame drops occur at the physical layer. Possible causes are bad cables, failing GBICs, or other hardware issues.

Thursday, October 1, 2015

ESXTOP Capture scripts with various sampling rates and intervals

There are many sources on the web for performing ESXTOP captures. What I found in my experience, however, was that I would often have to create multiple captures on several ESX hosts at once or captures that would run for hours at a time. Of course CRON can be used to perform scheduled captures, but if you are experiencing a problem and want to kick off an ESXTOP capture on the spot, it's helpful to have something you can just copy and paste. Also, it's helpful to have information such as the host name the capture was run on and the time that it started and stopped easily determined.

Here are some ESXTOP capture scripts that can be used across multiple hosts at the same time and logging the output into the same location. The logs are saved with the host name in the file name so that they can be run across multiple hosts and each one will clearly identify (1) the host it was running on (2) the start time of the capture and (3) the interval and length of the capture.

Example 1 - Short ESXTOP captures across a few hours, higher detail

Sampling interval - 10 seconds

Capture length per file - 15 mins

# of Captures - 48

Total # of samples = 90*48 = 4,320

Total esxtop capture interval across all logs - 12 hours (48 logs at 15 min each)

for i in $(seq 1 48);do esxtop -a -b -d 10 -n 90 > /vmfs/volumes/VMFS-T2-LUN10/ESXTOP/esxtop-$(date +%m-%d-%H%M%S)-$(hostname)-15min-10s.csv;done

Example 2 - Longer ESXTOP captures across a few hours, moderate detail

Sampling interval - 30 seconds

Capture length per file - 1 hour

# of Captures - 12

Total # of samples = 120*12 = 1,440

Total esxtop capture interval across all logs - 12 hours (12 logs at 1 hour each)

for i in $(seq 1 12);do esxtop -a -b -d 30 -n 120 > /vmfs/volumes/VMFS-T2-LUN10/ESXTOP/esxtop-$(date +%m-%d-%H%M%S)-$(hostname)-1Hour-30s.csv;done

Of course you can create endless variations on this to adjust it to your needs.

Sample output format

The filenames will appear something similar to the format below. You can see how it clearly shows the details of the capture in the naming format.

esxtop-04-07-004631-ESX01.dir.svc.accenture.com-15min-10s.csv

This is quite useful when you have a series of capture logs as the commands above will create.

Usage

It is often useful to capture at short intervals during the start of a problem but also sometimes required to run ESXTOP captures for longer periods of hours or days. With a few of these commands on hand for different levels of sampling detail, you can run both in parallel - for example a detailed capture for 10 second interval samples for 2 hours during the start of a problem and then a longer capture for the next 24 hours with less detail at 30 second interval samples. I often will start a short interval capture on one SSH session on a host and then in parallel a longer interval capture on another SSH session. The short one will capture the immediate issue and the longer one will capture any longer term issue that may show trends over a period of several hours.

Size of the ESXTOP capture logs

ESXTOP Capture logs will vary widely in size depending on the number of objects on the host. The only sure way to estimate the potential size of the log captures is to run one an esxtop capture as above on a host and then record the average size per sample and extrapolate from there. In a large environment the size of a single ESXTOP sample may be on the order of 4.8 MB. So for example, in this case in example #1, each log file would be 4.8 MB * 90 = 432 MB each and the total space used during the entire script would be 4.8 * 4320 = 20736 MB = 20.7 GB . Since these files are easily analyzed in Windows perfmon, a manageable size is in the range of 200 - 300 MB per log file, so some adjustments may be needed in that case. If desired, the output can be piped to gzip and compressed on the fly, but this will create some CPU overhead on the host.

Other considerations

If the ESXTOP captures are initiated from an SSH session, then they are susceptible to any timeouts or interruptions in the SSH session, so it may be desirable to run the scripts from a console session on the host. In this case it is still possible to run scripts in parallel by using the "&" delimiter to run two commands at once or even creating a bash file, but again careful consideration needs to go into this since the console will be tied up running these commands. The commands can be halted at any time by using a Break or CTRL+C.

CRON Jobs

These scripts can be combined with CRON jobs on ESX to run on a scheduled basis. There are VMWare KB articles on how this can be done.

Tuesday, September 29, 2015

VMWare ESXi 5.x and 6.x SCSI Sense Decoder in Excel

This posting is for VMWare vSphere/ESXi administrators using ESXi 5.x and ESXi 6.x. If, like me, you have been wondering about the meaning of the various SCSI error codes that you may observe in the ESXi vmkernel.log file, then this is for you. I'm sure there are other tools out there that can do something similar, and there are some VMWare KB articles which describe how to decode some of the error codes, but frankly it's exceedingly tedious to perform it manually for more than a couple entries. Wouldn't it be nice if you could just copy and paste out one of those entries, such as the ones below and have it converted into "human readable" format? Now you can.

2015-04-04T21:07:30.305Z cpu3:2050)ScsiDeviceIO: 2319: Cmd(0x4124084300) 0x12, CmdSN 0x55 to dev "naa.60000970000335800219533583923245" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0

Dec 6 20:44:24 esx505 vmkernel: 3:11:39:55.071 cpu4:4200)ScsiDeviceIO: 1548: Command 0x2a to device "naa.60000970000348700483133031384245" failed H:0xc D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0

So the first instance decodes as follows:

Host	0	NO error
Device	2	CHECK CONDITION
Plugin	0	No error.
Sense Key	5	ILLEGAL REQUEST
ASC	24	24/00
ASCQ	0

The second one decodes like this:

Host	c	VMK_SCSI_HOST_RETRY*
Device	0	GOOD
Plugin	0	No error.
Sense Key	5	ILLEGAL REQUEST
ASC	24	24/00
ASCQ	0

I've cobbled together an Excel document where you can just paste in the text of the SCSI error code from the log entry and it decodes as shown above. It's not particularly elegant - it still requires some polishing but for now, it get's the job done.

The document contains SCSI lookup tables from the T10 standards, and I am not aware of any restrictions on storing that information in a document such as this, but if there are any, do let me know.

The spreadsheet is available from the following link:

SCSI_Error_Code_Table_ESXi_v3a.xlsx

Edit: I noticed that Florian Grehl has a very nice online version of a decoder as well. Feel free to use whichever one suits you best. I'm certainly interested if you find any different results.

Online VMware SCSI sense code decoder at Virten.net