Wednesday, October 7, 2015

ESXi 5.x Console Scripts - Part 2 - Storage Related Queries


LUN Related Queries 

The following commands can be used to retrieve information about the LUN details and LUN path configuration on the ESX host. The selected list of commands were chosen based on usefulness. For example, having a count of various types of LUNs, the number of paths per HBA, the total number of paths, and so on.
 


List unique paths
esxcli storage nmp path list | grep fc | uniq

Count unique paths
esxcli storage nmp path list | grep fc | uniq | wc -l

Note: This number is important. With ESXi 5.x there is a limit of 1024 paths. All types of paths count against this value, whether they are due to additional HBA ports, FA ports, or otherwise. In large environments, it is quite easy to reach this limit.


List path status for each LUN
esxcfg-mpath -b |grep -B1 "fc Adapter"| grep -v -e "--" |sed 's/Adapter.*//'

Sample:

naa.6007898044789300d0e32c98a509f314 : DGC Fibre Channel Disk (naa.6007898044789300d0e32c98a509f314)
   vmhba2:C0:T9:L1 LUN:1 state:active fc
   vmhba4:C0:T8:L1 LUN:1 state:active fc
   vmhba4:C0:T9:L1 LUN:1 state:active fc
   vmhba2:C0:T8:L1 LUN:1 state:active fc
naa.60080831873001e007b9f33f83689df11 : DGC Fibre Channel Disk (naa.60080831873001e007b9f33f83689df11)
   vmhba5:C0:T1:L23 LUN:23 state:active fc

Path details for a LUN

esxcli storage nmp path list --device <device naa ID> |grep fc

Example:

esxcli storage nmp path list --device naa.6007898044789300d0e32c98a509f314 |grep fc

fc.30003441bd69284:21000043bd69284-fc.60893310bad02cdf:500819443ce02c6f-naa.6007898044789300d0e32c98a509f314
fc.30003441bd691ac:21000043bd691ac-fc.60893310bad02cdf:5008194e3ce02c6f-naa.6007898044789300d0e32c98a509f314
fc.30003441bd691ac:21000043bd691ac-fc.60893310bad02cdf:500819453ce02c6f-naa.6007898044789300d0e32c98a509f314
fc.30003441bd69284:21000043bd69284-fc.60893310bad02cdf:5008194c3ce02c6f-naa.6007898044789300d0e32c98a509f314


Path status for a LUN

esxcli storage nmp path list --device <device naa ID>

Example:

esxcli storage nmp path list --device naa.6007898044789300d0e32c98a509f314

fc.20000033bd69584:21000043bd69284-fc.50030160bad02cdf:500819443ce02c6f-naa.6007898044789300d0e32c98a509f314
   Runtime Name: vmhba2:C0:T8:L1
   Device: naa.6007898044789300d0e32c98a509f314
   Device Display Name: DGC Fibre Channel Disk (naa.6007898044789300d0e32c98a509f314)
   Group State: active
   Array Priority: 1
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=5,RTP_health=UP}
   Path Selection Policy Path Config: PSP VMW_PSP_RR does not support path configuration.

fc.20000033bd6891ac:21000043be691ac-fc.50030118bad02cdf:5008194c3ce02c6f-naa.6007898044789300d0e32c98a509f314
   Runtime Name: vmhba4:C0:T9:L1
   Device: naa.6007898044789300d0e32c98a509f314
   Device Display Name: DGC Fibre Channel Disk (naa.6007898044789300d0e32c98a509f314)
   Group State: active unoptimized
   Array Priority: 0
   Storage Array Type Path Config: {TPG_id=2,TPG_state=ANO,RTP_id=14,RTP_health=UP}
   Path Selection Policy Path Config: PSP VMW_PSP_RR does not support path configuration.


Set Perennial Reservation

Example:

esxcli storage core device setconfig -d naa.6000b79c000317a10a4b8206d2825a83 --perennially-reserved=true

Find LUNs with and without Perennial Reservations

esxcli storage core device list | grep "Is Perennially Reserved: true" -B18 | grep naa
esxcli storage core device list | grep "Is Perennially Reserved: false" -B18 | grep naa

Find "offline" LUNs

esxcli storage core device list | grep "Is Offline: true" -B17 | grep naa

Find permanently detached devices (find permanently inaccessible LUNs)

esxcli storage core device detached list

Find "Status: unknown"

esxcli storage core device list | grep -v "Thin" | grep  "Status: unknown" -B12 | grep naa

Total LUN count

esxcli storage core device list | grep Vendor | wc -l

LUN Count by vendor name

Examples:
 
esxcli storage core device list | grep Vendor | grep EMC | wc -l
esxcli storage core device list | grep Vendor | grep DGC | wc -l

List Vendors

esxcli storage core device list | grep Vendor | sort | uniq

Sample Output:

   Vendor: DELL
   Vendor: DGC
   Vendor: DP
   Vendor: EMC
   Vendor: TSSTcorp
   Vendor: iDRAC
   Vendor: IBM

Count of LUNs per Vendor
 
esxcli storage core device list | grep Vendor | sort | uniq -c

Sample Output:

      1    Vendor: DELL
     54    Vendor: DGC
      1    Vendor: DP
    110    Vendor: EMC
     87    Vendor: IBM
      1    Vendor: TSSTcorp
      2    Vendor: iDRAC

Count paths per unique LUN

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f3 | sort| uniq -c
  
Sample Output:

      3 naa.60080831873001e00e26d1945094fe311
      3 naa.60080831873001e00f2655ba63689df11
      3 naa.60080831873001e00f3655ba63689df11
      3 naa.60080831873001e00f4655ba63689df11
      4 naa.6006898025782100483e7b4bdbedf314
      4 naa.60068980257821005ceed1735c11f314

Count Total Number of paths

esxcli storage nmp path list | grep "fc\." | wc -l

Count Number of LUNs per HBA

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f1 | sort| uniq -c

    174 fc.20000043bd691ac:21000033bd691ac
    160 fc.20000043bd691ad:21000033bd691ad
    130 fc.20000043bd69284:21000033bd69284

Count number of LUNs visible per FA (counts LUNs more than once)

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f2 | sort| uniq -c

110 fc.500009321498a000:500009321498a1dc
      1 fc.5000143047324698:50001442a0461500
      2 fc.5000143047324698:50001443a0461501
      1 fc.500014404770465c:50001442b0465c00
      2 fc.500014404770465c:50001442b0465c01
     22 fc.500143800200096a:500143810200096a
 
Notes: Breakdown of this output

fc.20000043bd691ac:21000043bd691ac-fc.50081940bce02c69:500819463ce02c69-naa.60080831873001e00f2655ba63689df11

HBA
FA
LUN

Count of LUNs with a single path

Note: Does not detect multiple paths on a single HBA.

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f3 | sort| uniq -c | grep "1 naa" | wc -l

List of LUNs with paths only on a single HBA

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f1,3 | sort| uniq | cut -d '-' -f2 | sort | uniq -c | grep "1 naa"
esxcli storage nmp path list | grep "fc\." | cut -d '-' -f1,3 | sort| uniq | cut -d '-' -f2 | sort | uniq -c | grep "1 naa" | cut -d ' ' -f8

Count of LUNs with paths only on a single HBA

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f1,3 | sort| uniq | cut -d '-' -f2 | sort | uniq -c | grep "1 naa" | wc -l
  
Count paths per unique LUN

esxcli storage nmp path list | grep "fc\." | cut -d '-' -f3 | sort| uniq -c

Identify list of HBA's

esxcli storage nmp path list | grep "fc\." |  cut -d '-' -f1 | sort | uniq

Sunday, October 4, 2015

ESXi 5.x Console Scripts - Part 1 - Storage events

This post is the first in a series on parsing the ESXi log files and using the ESX console - primarily parsing vmkernel logs.
This post will deal strictly with storage related events that appear in the vmkernel and vobd logs. The focus of the parsing
commands will be looking for events that indicate some sort of problem - either a performance issue or some sort  of failure.

There are a number of tools out there that can anayze the various ESXi logs, however, I've found that there are frequently
dynamic situations that arise which require a quick analysis of the logs for events that may indicate a problem. These parsing
scripts (one-line commands really), are intended to be a sort of "swiss army knife" you can use to slice and dice the logs for
particular types of errors or warnings.

What I find most useful about them, is that some of the commands parse out the number of occurrences of a particular error
or warning. If you have, say three storage related warnings in your logs over the last month, it's probably nothing to worry about.
If you have 100 events, it may require further investigation. If you have 100 events for the same LUN, then it probably requires
immediate investigation.

I hope that you find some of these useful. They've been extremely useful for me over the last couple years as I added to them
each time I encountered a new type of significant error or warning. Many of these commands may be valid for ESXi 6.0 but I have
 not had the time to test them all out, so please feel free to post in the comments on your observations with ESXi 6.0 and above.

Latency in Logs

The following commands report on the properties of the LUNs reporting latency

Note: The formatting for this posting may be adjusted. Please be patient while it is worked out. The content is all valid nevertheless.

High Latency Events in Current vmkernel log 

less /var/log/vmkernel.log | grep "I/O latency"

Sample Output
2015-10-04T06:50:46.103Z cpu48:33785)WARNING: ScsiDeviceIO: 1223: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency increased from average value of 3399 microseconds to 68621 microseconds.
2015-10-04T06:59:11.887Z cpu8:33783)WARNING: ScsiDeviceIO: 1223: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has deteriorated. I/O latency increased from average value of 3400 microseconds to 141513 microseconds.
2015-10-04T07:25:47.971Z cpu42:33785)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has improved. I/O latency reduced from 141513 microseconds to 28281 microseconds.
2015-10-04T07:51:24.120Z cpu34:33783)ScsiDeviceIO: 1203: Device naa.600602600eb02200b24c5fb1a8b6e411 performance has improved. I/O latency reduced from 28281 microseconds to 6663 microseconds.
2015-10-04T11:56:07.092Z cpu44:3
  
High Latency events in all logs including gzipped

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "I/O latency";done | less

Sample Output
2015-10-03T12:05:09.793Z cpu32:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846 performance has improved. I/O latency reduced from 98388 microseconds to 19519 microseconds.
2015-10-03T12:06:01.464Z cpu49:33785)ScsiDeviceIO: 1203: Device naa.600009800003322100713533035413846 performance has improved. I/O latency reduced from 19519 microseconds to 9704 microseconds.
2015-10-02T08:10:31.272Z cpu50:33783)WARNING: ScsiDeviceIO: 1223: Device naa.600009800003322100713533033303533 performance has deteriorated. I/O latency increased from average value of 4331 microseconds to 86771 microseconds.
2015-10-02T08:14:57.857Z cpu51:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533 performance has improved. I/O latency reduced from 86771 microseconds to 17009 microseconds.
2015-10-02T08:16:00.673Z cpu47:33783)ScsiDeviceIO: 1203: Device naa.600009800003322100713533033303533 performance has improved. I/O latency reduced from 17009 microseconds to 8677 microseconds.


Count of latency events in all logs
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "I/O latency";done | wc -l

Sample Output
99

Count of latency increase events per LUN ID in current log

cat /var/log/vmkernel.log | grep -i scsideviceio | grep naa | grep -v "performance has" | cut -d ' ' -f 13 | sort | uniq -c
      3 "naa.60000880000289202113533031463734"
      1 "naa.600009800003322100713533034434241"
      1 "naa.600009800003322100713533035314533"
      1 "naa.600009800003322100713533035413845"

Note: This filters out the messages for latency increases from ones which report latency improvements.

Count of latency increase events per LUN ID in all vmkernel logs

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep -i scsideviceio | grep naa.60000 | grep -v "performance has" | cut -d ' ' -f 13 | sort | uniq -c ;done
      2 "naa.60000880000289202113533030334436"
      5 "naa.60000880000289202113533030373445"
      2 "naa.60000880000289202113533030383341"
      69 "naa.60000880000289202113533030393644"
      2 "naa.60000880000289202113533031414345"

Note: In the following sections the commands to parse the current vmkernel log and all available vmkernel logs will be listed together.


SCSI Errors in Logs

The following commands report on the properties of the LUNs reporting SCSI command errors

Count of SCSI error per SCSI Opcode

cat /var/log/vmkernel.log | grep -i scsideviceio | grep naa | grep -v "performance has" | cut -d ' ' -f 5 | sort | uniq -c 
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep -i scsideviceio | grep naa | grep -v "performance has" | cut -d ' ' -f 5 | sort | uniq -c ;done

Sample Output

     38 0x28,
     26 0x2a,
    104 0x85,
     68 0xfe,

Where the first number is the count and the second is the SCSI Opcode.

Note: The SCSI Opcodes are based on the T10 standards. A few of the more common ones are listed here:

SCSI Opcode
Command Details
0x28
Read
0x2a
Write
0x85
ATA Pass-through
0x89
SCSI Compare and Write - ATS
0x41
WRITE SAME(10)
0x93
WRITE SAME(16)

As for the 0xfe opcode, which I have observed often in the logs that I have analyzed, I have not been able to pin it down yet, but my suspicion is that it is some sort of VAAI specific Opcode for devices manufactured by EMC.

Exclude VAAI command errors

The following two filters are based on the assumption that 0xfe is a valid VAAI command Opcode

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa | grep -v "performance has" | grep -v "0xfe" |  cut -d ' ' -f 13 | sort | uniq -c;done

Sample Output

      2 "naa.60000880000289202113533030334436"
      2 "naa.60000880000289202113533030373445"
      2 "naa.60000880000289202113533030383341"
      2 "naa.60000880000289202113533030443130"
     32 "naa.60000880000289202113533030444537"
  
Include only VAAI command errors - view entries

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa | grep "0xfe" ;done

Sample Output

2015-10-03T06:33:57.661Z cpu27:33785)ScsiDeviceIO: 2325: Cmd(0x412ec369ed40) 0xfe, CmdSN 0x44b41a from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x2 0x3a 0x2.
2015-10-03T09:15:51.662Z cpu54:33785)ScsiDeviceIO: 2325: Cmd(0x413ac33a0080) 0xfe, CmdSN 0x44cb42 from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x5 0x20 0x0.
2015-10-03T10:00:57.663Z cpu49:33783)ScsiDeviceIO: 2325: Cmd(0x413ac373ea80) 0xfe, CmdSN 0x44d174 from world 32837 to dev "naa.60000880000289202113533031414145" failed H:0x0 D:0x2 P:0x5


Count only VAAI command errors per LUN

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep -i scsideviceio | grep naa | grep -v "performance has" | grep "0xfe" |  cut -d ' ' -f 13 | sort | uniq -c;done

Sample Output

     37 "naa.60000880000289202113533030444537"
      4 "naa.60000880000289202113533031353639"
     27 "naa.60000880000289202113533031414145"

Count Errors Host/Device/Sense Code

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Valid sense" | cut -d ' ' -f 15-23 | sort | uniq -c ;done
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Possible sense" | cut -d ' ' -f 15-23 | sort | uniq -c ;done

Sample Output

   5109 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
   8252 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
     18 H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
      2 H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.
     37 H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
      4 permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5


Note: This one is particularly useful when combined with the SCSI sense code decoder such as the one that was posted last week on this site or the one by Florian Grehl at Virten.net.

List which SCSI command Serial Numbers failed

for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Valid sense" | cut -d ' ' -f 4,15-23 | sort ;done

Sample Output

Cmd(0x413cc81857c0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8185cc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186940) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Cmd(0x413cc8186bc0) H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Path permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5
Path permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5

Note: You can see from this that there are multiple occurrences of the same SCSI command serial number in the logs. I'm still working out the significance of this. Feel free to post in the comments if you have something to add.


ATS Miscompare

for i in '/var/log/vmkernel*.*'; do cat $i  | grep "ATS Miscompare";done | less
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "ATS Miscompare";done | less

Sample output

2015-10-02T14:53:48.632Z cpu19:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-01T08:01:12.557Z cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-01T13:49:42.585Z cpu48:33783)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

*Note: This is helpful in parsing logs related to the recent KB articles posted about issues when using ESXi 5.5 Update 2 as noted in the article below and many recent blog entries.



Count of "Reclaimed heartbeat" events

for i in '/var/log/vmkernel*.*'; do cat $i  | grep "Reclaimed heartbeat";done | less
for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "Reclaimed heartbeat";done | less

Sample Output
2015-10-03T10:00:57.868Z cpu56:32921)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664
2015-10-02T06:16:06.813Z cpu48:32921)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664
2015-10-02T08:52:33.618Z cpu32:32919)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664

Note: These events indicate some problem with the heartbeats to a datastore being retried which is not normal behavior.

Generic filter for Heartbeat related events

~ # for i in '/var/run/log/vmkernel*.gz'; do zcat $i  | grep "HB";done | less

Sample Output

2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.
2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1931: Test version: [HB state abcdef02 offset 3825664 gen 3125 stampUS 8070728625033 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]
2015-10-03T06:33:57.661Z cpu27:33785)HBX: 1932: Set version: [HB state abcdef02 offset 3825664 gen 3125 stampUS 8070731625055 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]
2015-10-03T06:33:57.762Z cpu43:32919)HBX: 270: Reclaimed heartbeat for volume 51c2a4c2-2ab2da33-3289-14feb5c67712 (VMFS-ESXT2-LUN13b): [Timeout] Offset 3825664
2015-10-03T06:33:57.762Z cpu43:32919)[HB state abcdef02 offset 3825664 gen 3125 stampUS 8070731734593 uuid 559450dc-b4b36329-0bbe-d4ae52e9ddda jrnl <FB 497889> drv 14.60]
2015-10-03T09:15:51.662Z cpu54:33785)HBX: 1930: ATS Miscompare detected between test and set HB images at offset 3825664 on vol '51c2a4c2-2ab2da33-3289-14feb5c67712'.

Note: I'm still researching what some of these messages mean. The VMWare KB's and web searches in general turn up very limited information on exactly what these messages indicate.

FC Blocked

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep 'blocked FC';done

Note: These errors can occur when there is a problem with multiple FC paths being down or if the HBA and LUN queues fill up. An example where this can occur is an FC array level failure or multiple LUN trespasses on an ALUA based array.


Bus reset issued at ESX host level

less /var/log/vmkernel.log | grep -i "failed H:0x8";done
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "H:0x8";done

Note: This is an interesting SCSI sense code to watch for. It means that the host has determined that there is a serious problem with the LUN in question and has issues a SCSI bus reset command to it in an attempt to bring it back to normal operation.

VSCSI reset request

for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep 'Reset request ';done

Sample Output

2015-09-27T16:05:24.476Z cpu34:7822793)VSCSI: 2447: handle 8502(vscsi0:1):Reset request on FSS handle 1413303666 (0 outstanding commands) from (vmm0:ESXVM2010.somedomain.com)

Note: This indicates that a SCSI reset request was submitted by the VM itself to one of the LUNs on either and RDM or a datastore. The conditions under which the LUN reset is issues are OS dependent, but Windows 2003 and above will issue LUN reset commands if too many SCSI commands queue up within the OS and are not acknowledged within a set amount of time. In the case of Windows 2008 and above, this requires several minutes to trigger, so clearly indicates a serious problem.


VOBD Log errors

Generic "lost" errors

cat /var/log/vobd.og |  grep -i "lost"
for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "lost";done

Sample Output

2015-04-18T01:36:28.187Z: [vmfsCorrelator] 6479770535964us: [vob.vmfs.ats.support.lost] ATS-Only VMFS volume 'VMFS-ESXT2-LUN02' is not mounted. This host does not support ATS or ATS initialization has failed.
2015-02-09T15:05:25.285Z: [scsiCorrelator] 653098960126us: [esx.problem.storage.redundancy.lost] Lost path redundancy to storage device naa.60000480000834600113533036414143. Path vmhba2:C0:T0:L56 is down. Affected datastores: Unknown.
2015-02-09T15:05:25.290Z: [scsiCorrelator] 653098965819us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.60000480000234600112533036414143. Path vmhba4:C0:T0:L56 is down. Affected datastores: Unknown.

Note: The top ATS related error is mentioned in the article further up in this posting about the issue with ATS heartbeats on ESXi 5.5 Update 2.

Specific "lost" errors

for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "lost connectivity";done
for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "lost path redundancy";done
for i in '/var/run/log/vobd*.gz'; do zcat $i  | grep -i "ATS initialization has failed";done

Find LUNs that have been "unmapped"

cat /var/log/vmkernel.log | grep "unmapped" | cut -d ' ' -f 6 | sort | uniq -c
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "unmapped" | cut -d ' ' -f 6 | sort | uniq -c;done


Find LUNs marked permanently inaccessible

cat /var/log/vmkernel.log | grep "permanently inaccessible" | cut -d ' ' -f 6 | sort | uniq -c
for i in '/var/run/log/vmkernel*.gz'; do zcat $i | grep "permanently inaccessible" | cut -d ' ' -f 6 | sort | uniq -c;done

Sample Output

2015-09-29T09:26:45.442Z cpu18:21545480)WARNING: ScsiDevice: 1503: Device naa.600604500eb03200559c81366367e511 has been plugged back in after being marked permanently inaccessible. No data consistency guarantees


Dropped frames on the HBA (assumes Qlogic)

for i in '/proc/scsi/qla2xxx/*'; do cat $i|grep retried;done

for i in '/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a'; do $i|grep retried;done

Sample Output

Commands retried with dropped frame(s) = 0
Commands retried with dropped frame(s) = 200
Commands retried with dropped frame(s) = 0
Commands retried with dropped frame(s) = 1

Note: This particular example only works for Qlogic based HBAs. I'm working on an equivalent for Emulex. Basically, though, these numbers should always be zero.
Anything other than a zero number indicates a problem at the hardware level, since the frame drops occur at the physical layer. Possible causes are bad cables, failing GBICs, or other hardware issues.

Thursday, October 1, 2015

ESXTOP Capture scripts with various sampling rates and intervals



There are many sources on the web for performing ESXTOP captures. What I found in my experience, however, was that I would often have to create multiple captures on several ESX hosts at once or captures that would run for hours at a time. Of course CRON can be used to perform scheduled captures, but if you are experiencing a problem and want to kick off an ESXTOP capture on the spot, it's helpful to have something you can just copy and paste. Also, it's helpful to have information such as the host name the capture was run on and the time that it started and stopped easily determined.

Here are some ESXTOP capture scripts that can be used across multiple hosts at the same time and logging the output into the same location. The logs are saved with the host name in the file name so that they can be run across multiple hosts and each one will clearly identify (1) the host it was running on (2) the start time of the capture and (3) the interval and length of the capture.

Example 1 - Short ESXTOP captures across a few hours, higher detail
Sampling interval - 10 seconds
Capture length per file - 15 mins
# of Captures - 48
Total # of samples  = 90*48 = 4,320
Total esxtop capture interval across all logs - 12 hours (48 logs at 15 min each)

for i in $(seq 1 48);do esxtop -a -b -d 10 -n 90 > /vmfs/volumes/VMFS-T2-LUN10/ESXTOP/esxtop-$(date +%m-%d-%H%M%S)-$(hostname)-15min-10s.csv;done

Example 2 - Longer ESXTOP captures across a few hours, moderate detail
Sampling interval - 30 seconds
Capture length per file - 1 hour
# of Captures - 12
Total  # of samples = 120*12 = 1,440
Total esxtop capture interval across all logs - 12 hours (12 logs at 1 hour each)

for i in $(seq 1 12);do esxtop -a -b -d 30 -n 120 > /vmfs/volumes/VMFS-T2-LUN10/ESXTOP/esxtop-$(date +%m-%d-%H%M%S)-$(hostname)-1Hour-30s.csv;done

 Of course you can create endless variations on this to adjust it to your needs.

Sample output format

The filenames will appear something similar to the format below. You can see how it clearly shows the details of the capture in the naming format.

esxtop-04-07-004631-ESX01.dir.svc.accenture.com-15min-10s.csv

This is quite useful when you have a series of capture logs as the commands above will create.

Usage

It is often useful to capture at short intervals during the start of a problem but also sometimes required to run ESXTOP captures for longer periods of hours or days. With a few of these commands on hand for different levels of sampling detail, you can run both in parallel - for example a detailed capture for 10 second interval samples for 2 hours during the start of a problem and then a longer capture for the next 24 hours with less detail at 30 second interval samples. I often will start a short interval capture on one SSH session on a host and then in parallel a longer interval capture on another SSH session. The short one will capture the immediate issue and the longer one will capture any longer term issue that may show trends over a period of several hours.

Size of the ESXTOP capture logs

ESXTOP Capture logs will vary widely in size depending on the number of objects on the host. The only sure way to estimate the potential size of the log captures is to run one an esxtop capture as above on a host and then record the average size per sample and extrapolate from there. In a large environment the size of a single ESXTOP sample may be on the order of 4.8 MB. So for example, in this case in example #1, each log file would be 4.8 MB * 90 = 432 MB each and the total space used during the entire script would be 4.8 * 4320 = 20736 MB = 20.7 GB . Since these files are easily analyzed in Windows perfmon, a manageable size is in the range of 200 - 300 MB per log file, so some adjustments may be needed in that case. If desired, the output can be piped to gzip and compressed on the fly, but this will create some CPU overhead on the host.

Other considerations

If the ESXTOP captures are initiated from an SSH session, then they are susceptible to any timeouts or interruptions in the SSH session, so it may be desirable to run the scripts from a console session on the host. In this case it is still possible to run scripts in parallel by using the "&" delimiter to run two commands at once or even creating a bash file, but again careful consideration needs to go into this since the console will be tied up running these commands. The commands can be halted at any time by using a Break or CTRL+C.

CRON Jobs

These scripts can be combined with CRON jobs on ESX to run on a scheduled basis. There are VMWare KB articles on how this can be done.


Tuesday, September 29, 2015

VMWare ESXi 5.x and 6.x SCSI Sense Decoder in Excel


 This posting is for VMWare vSphere/ESXi administrators using ESXi 5.x and ESXi 6.x. If, like me, you have been wondering about the meaning of the various SCSI error codes that you may observe in the ESXi vmkernel.log file, then this is for you. I'm sure there are other tools out there that can do something similar, and there are some VMWare KB articles which describe how to decode some of the error codes, but frankly it's exceedingly tedious to perform it manually for more than a couple entries. Wouldn't it be nice if you could just copy and paste out one of those entries, such as the ones below and have it converted into "human readable" format? Now you can.

2015-04-04T21:07:30.305Z cpu3:2050)ScsiDeviceIO: 2319: Cmd(0x4124084300) 0x12, CmdSN 0x55 to dev "naa.60000970000335800219533583923245" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0

Dec 6  20:44:24 esx505 vmkernel: 3:11:39:55.071 cpu4:4200)ScsiDeviceIO: 1548: Command 0x2a to device "naa.60000970000348700483133031384245" failed H:0xc D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0
 
So the first instance decodes as follows:

Host
0
NO error
Device
CHECK CONDITION
Plugin
No error.
Sense Key
ILLEGAL REQUEST
ASC
24
24/00
ASCQ
0


 
The second one decodes like this:

Host
c
VMK_SCSI_HOST_RETRY*
Device
GOOD
Plugin
No error.
Sense Key
ILLEGAL REQUEST
ASC
24
24/00
ASCQ
0


I've cobbled together an Excel document where you can just paste in the text of the SCSI error code from the log entry and it decodes as shown above. It's not particularly elegant - it still requires some polishing but for now, it get's the job done. 

The document contains SCSI lookup tables from the T10 standards, and I am not aware of any restrictions on storing that information in a document such as this, but if there are any, do let me know.

The spreadsheet is available from the following link:

SCSI_Error_Code_Table_ESXi_v3a.xlsx


Edit: I noticed that Florian Grehl has a very nice online version of a decoder as well. Feel free to use whichever one suits you best. I'm certainly interested if you find any different results.