“Having a SCSI ID is a f*cking idiotic thing to do.”
…and after the amount of time I’ve wasted getting XenServer to play nicely with LIO iSCSI failover I tend to agree.
The Problem
One oddity of Xen / XenServer’s storage subsystem is that it identifies iSCSI storage repositories via a calculated SCSI ID rather than the iSCSI Serial - which would be the sane thing to do.
Citrix’s less than ideal take on dealing with SCSI ID changes is for you to take your VMs offline, disconnected the storage repositories, recreate them, then go through all your VMs and re-attach their orphaned disks hoping that you remembered to add some sort of hint as to what VM they belong to, then finally wipe the sweat and tears from your face.
From CTX11641 - ‘How to Identify If SCSI Storage Repository has Changed SCSI IDs’:
“The SCSI ID of the logical unit number (LUN) changed. When this happened, the iSCSI storage repository became unplugged after a XenServer reboot.” … “To correct the issue you must recreate a PBD with the entry to reflect the right SCSI ID.”
The Solution
A big thank you to Nicholas A. Bellinger from the Kernel SCSI mailing list who helped me a lot in this thread where he explained:
“The Company ID, VSI, and VSIE are generated by LIO based upon the current
vpd_unit_serial
configfs attribute value. So as long asvpd_unit_serial
is persistent, and the same value for backend devices across export failover to different nodes, Xen will always see the same EVPD information.”An example SCSI ID of
0x6001405bff3f42a49d84cfcb64e2b933
would thus be comprised of:
- NAA 6, IEEE Company_id:
0x1405
- Vendor Specific Identifier:
0xbff3f42a4
- Vendor Specific Identifier Extension:
0x9d84cfcb64e2b933
In addition to the vpd_unit_serial
we found that the iblock
number must also remain the same between failovers.
/sys/kernel/config/target/core/iblock_0/lun_name/wwn/vpd_unit_serial
/sys/kernel/config/target/core # tree
├── iblock_0 # Must be consistent between failovers
│ └── iscsi_lun_r2
│ └── wwn
│ └── vpd_unit_serial # Must be consistent between failovers
If you’re using Corosync / Pacemaker for your target failover the vpd_unit_serial
and iblock
number must both be set in the iSCSILogicalUnit
OCF provider:
- The
iblock
number is configured withlio_iblock=<number>
- See iSCSILogicalUnit#L217 - The
vpd_unit_serial
is configured withscsi_sn=<number>
- See iSCSILogicalUnit#L138
Here is an example of a target and lun configured with pcs
:
Resource: iscsi_target_r2 (class=ocf provider=heartbeat type=iSCSITarget)
Attributes: iqn=iqn.2003-01.org.linux-iscsi.pm-san.x8664:sn.ca7d7b33c731 portals=10.50.42.75:3260 implementation=lio-t additional_parameters="MaxConnections=100 AuthMethod=None InitialR2T=No MaxOutstandingR2T=64"
Operations: monitor on-fail=restart interval=30s timeout=20s (iscsi_target_r2-monitor-30s)
start on-fail=restart interval=0 timeout=20s (iscsi_target_r2-start-0)
stop on-fail=restart interval=0 timeout=20s (iscsi_target_r2-stop-0)
Resource: iscsi_lun_r2 (class=ocf provider=heartbeat type=iSCSILogicalUnit)
Attributes: target_iqn=iqn.2003-01.org.linux-iscsi.pm-san.x8664:sn.ca7d7b33c731 scsi_sn=633c5643 lun=1 lio_iblock=2 path=/dev/drbd2 allowed_initiators="iqn.2015-05.com.example:51e1fb93" implementation=lio-t
Operations: monitor on-fail=restart interval=30s timeout=10s (iscsi_lun_r2-monitor-30s)
start on-fail=restart interval=0 timeout=20s (iscsi_lun_r2-start-0)
stop on-fail=restart interval=0 timeout=20s (iscsi_lun_r2-stop-0)
Resource: iscsi_conf_r2 (class=ocf provider=heartbeat type=anything)
Attributes: binfile=/usr/sbin/iscsi_iscsi_conf_r2.sh stop_timeout=3
If you happen to be using Puppet for your Pacemaker configuration it might look a bit like this:
cs_primitive { "$iscsi_target_primitive":
primitive_class => 'ocf',
primitive_type => 'iSCSITarget',
provided_by => 'heartbeat',
parameters => { 'iqn' => "$iscsi_iqn",
'portals' => "${iscsi_vip}:3260",
'implementation' => 'lio-t',
'additional_parameters' => 'MaxConnections=100 AuthMethod=None InitialR2T=No MaxOutstandingR2T=64',
},
operations => { 'monitor' => { 'timeout' => '20s', 'interval' => '30s','on-fail' => "restart"},
'start' => { 'timeout' => '20s', 'interval' => '0','on-fail' => "restart"},
'stop' => { 'timeout' => '20s', 'interval' => '0','on-fail' => "restart"},
},
require => [Cs_primitive["$ip_primitive"],Package['targetcli'],Service['pacemaker']],
}
cs_primitive { "$iscsi_lun_primitive":
primitive_class => 'ocf',
primitive_type => 'iSCSILogicalUnit',
provided_by => 'heartbeat',
parameters => { 'target_iqn' => $iscsi_iqn,
'scsi_sn' => $scsi_sn,
'lun' => '1',
'lio_iblock' => $lio_iblock,
'path' => $drbd_path,
'allowed_initiators' => $allowed_initiators,
'implementation' => 'lio-t'},
operations => { 'monitor' => { 'timeout' => '10s', 'interval' => '30s','on-fail' => "restart" },
'start' => { 'timeout' => '20s', 'interval' => '0' ,'on-fail' => "restart" },
'stop' => { 'timeout' => '20s', 'interval' => '0' ,'on-fail' => "restart" },
},
require => [Cs_primitive["$iscsi_target_primitive"],Service['pacemaker']],
}