Flash Storage and SSD Failure Rate Update (March 2018)

It was almost 3 years ago that my open source storage project went into production. In that time it’s been running 24/7 serving as highly available solid state storage for hundreds of VMs and several virtualisation clusters across our two main sites. I’m happy to report that the clusters have been operating very successfully since their conception. Since moving away from proprietary ‘black box’ vendor SANs, we haven’t had a single SAN issue, storage outage. ...

March 20, 2018 · 2 min · 326 words · Sam McLeod

Talk - Clustered, Distributed File and Volume Storage with GlusterFS

Using GlusterFS to provide volume storage to Kubernetes as a replacement for our existing file and static content hosting. This talk was given at Infracoders on Tuesday 14th November 2017. NOTE: Below link to slides currently broken - will fix soon! (03/08/2019) Click below to view slides (PDF version): Direct download link

November 14, 2017 · 1 min · 52 words · Sam McLeod

GlusterFS

We’re in the process of shifting from using our custom ‘glue’ for orchestrating Docker deployments to Kubernetes, When we first deployed Docker to replace LXC and our legacy Puppet-heavy application configuration and deployment systems there really wasn’t any existing tool to manage this, thus we rolled our own, mainly a few Ruby scripts combined with a Puppet / Hiera / Mcollective driven workflow. The main objective is to replace our legacy NFS file servers used to host uploads / attachments and static files for our web applications, while NFS(v4) performance is adequate, it is a clear single point of failure and of course, there are the age old stale mount problems should network interruptions occur. ...

September 25, 2017 · 6 min · 1106 words · Sam McLeod

Update Delayed Serial STONITH Design

note: This is a follow up post from 2015-07-21-rcd-stonith A Linux Cluster Base STONITH provider for use with modern Pacemaker clusters This has since been accepted and merged into Fedora’s code base and as such will make it’s way to RHEL. Source Code: Github Diptrace CAD Design: Github I have open sourced the CAD circuit design and made this available within this repo under CAD Design and Schematics Related RedHat Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1240868 v1 vs v2/v3 versions of the rcd_serial STONITH system The v2/v3 cables include the following improvements: ...

July 4, 2016 · 2 min · 217 words · Sam McLeod

Benchmarking IO with FIO

This is a quick tldr there are many other situations and options you could consider FIO man page IOP/s = Input or Output operations per second Throughput = How many MB/s can you read/write continuously Variables worth tuning based on your situation --iodepth The iodepth is very dependant on your hardware. Rotational drives without much cache and high latency (i.e. desktop SATA drives) will not benefit from a large iodepth, Values between 16 to 64 could be sensible. ...

April 29, 2016 · 2 min · 393 words · Sam McLeod

Fix XenServer SR with corrupt or invalid metadata

If a disk / VDI is orphaned or only partially deleted you’ll notice that under the SR it’s not assigned to any VM. This can cause issues that look like metadata corruption resulting in the inability to migrate VMs or edit storage. For example: [root@xenserver-host ~]# xe vdi-destroy uuid=6c2cd848-ac0e-441c-9cd6-9865fca7fe8b Error code: SR_BACKEND_FAILURE_181 Error parameters: , Error in Metadata volume operation for SR. [opterr=VDI delete operation failed for parameters: /dev/VG_XenStorage-3ae1df17-06ee-7202-eb92-72c266134e16/MGT, 6c2cd848-ac0e-441c-9cd6-9865fca7fe8b. Error: Failed to write file with params [3, 0, 512, 512]. Error: 5], Removing stale VDIs To fix this, you need to remove those VDIs from the SR after first deleting the logical volume: ...

January 18, 2016 · 2 min · 296 words · Sam McLeod

iSCSI SCSI-ID / Serial Persistence

“Having a SCSI ID is a f*cking idiotic thing to do.” - Linus Torvalds …and after the amount of time I’ve wasted getting XenServer to play nicely with LIO iSCSI failover I tend to agree. The Problem One oddity of Xen / XenServer’s storage subsystem is that it identifies iSCSI storage repositories via a calculated SCSI ID rather than the iSCSI Serial - which would be the sane thing to do. ...

December 14, 2015 · 3 min · 626 words · Sam McLeod

How to cluster and failover (almost) anything - An intro to Pacemaker and Corosync

Slides Failover Demo

November 9, 2015 · 1 min · 3 words · Sam McLeod

SAN Intro

October 7, 2015 · 0 min · 0 words · Sam McLeod

SSD Storage - Two Months In Production

Over the last two months I’ve been running selected IO intensive servers off the the SSD storage cluster, these hosts include (among others) our: Primary Puppetmaster Gitlab server Redmine app and database servers Nagios servers Several Docker database host servers Reliability We haven’t had any software or hardware failures since commissioning the storage units. During this time we have had 3 disk failures on our HP StoreVirtual SANs that have required us to call the supporting vendor and replace failed disks. ...

September 13, 2015 · 2 min · 376 words · Sam McLeod

iSCSI Benchmarking

The following are benchmarks from our testings of our iSCSI SSD storage. 67,300 read IOP/s on a VM on iSCSI (Disk -> LVM -> MDADM -> DRBD -> iSCSI target -> Network -> XenServer iSCSI Client -> VM) Per VM and scales to 1,000,000 IOP/s total root@dev-samm:/mnt/pmt1 128 # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=128 --size=2G --readwrite=read test: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=128 2.0.8 Starting 1 process bs: 1 (f=1): [R] [55.6% done] [262.1M/0K /s] [67.3K/0 iops] [eta 00m:04s] 38,500 random 4k write IOP/s on a VM on iSCSI (Disk -> LVM -> MDADM -> DRBD -> iSCSI target -> Network -> XenServer iSCSI Client -> VM) Per VM and scales to 700,000 IOP/s total root@dev-samm:/mnt/pmt1 # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=128 --size=2G --readwrite=randwrite test: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=128 2.0.8 Starting 1 process bs: 1 (f=1): [w] [26.3% done] [0K/150.2M /s] [0 /38.5K iops] [eta 00m:14s] Raw device latency on storage units Intel DC3600 1.2T PCIe NVMe root@s1-san6:/proc # ioping /dev/nvme0n1p1 4.0 KiB from /dev/nvme0n1p1 (device 1.1 TiB): request=1 time=104 us 4.0 KiB from /dev/nvme0n1p1 (device 1.1 TiB): request=2 time=83 us 4.0 KiB from /dev/nvme0n1p1 (device 1.1 TiB): request=3 time=51 us 4.0 KiB from /dev/nvme0n1p1 (device 1.1 TiB): request=4 time=71 us SanDisk SDSSDXPS960G SATA root@pm-san5:/proc # ioping /dev/sdc 4.0 KiB from /dev/sdc (device 894.3 GiB): request=1 time=4.2 ms 4.0 KiB from /dev/sdc (device 894.3 GiB): request=2 time=4.1 ms 4.0 KiB from /dev/sdc (device 894.3 GiB): request=3 time=4.1 ms 4.0 KiB from /dev/sdc (device 894.3 GiB): request=4 time=4.1 ms Micron_M600_MTFDDAK1T0MBF SATA root@pm-san5:/proc # ioping /dev/sdf 4.0 KiB from /dev/sdf (device 953.9 GiB): request=1 time=157 us 4.0 KiB from /dev/sdf (device 953.9 GiB): request=2 time=190 us 4.0 KiB from /dev/sdf (device 953.9 GiB): request=3 time=65 us 4.0 KiB from /dev/sdf (device 953.9 GiB): request=4 time=181 us ```shell ## Latency on the a VM - (Disk -> LVM -> MDADM -> DRBD -> iSCSI target -> Network -> XenServer iSCSI Client -> VM) ```shell root@dev-samm:/mnt 127 # ioping pmt1/ 4096 bytes from pmt1/ (ext4 /dev/xvdb1): request=1 time=0.6 ms 4096 bytes from pmt1/ (ext4 /dev/xvdb1): request=2 time=0.7 ms 4096 bytes from pmt1/ (ext4 /dev/xvdb1): request=3 time=0.7 ms --- pmt1/ (ext4 /dev/xvdb1) ioping statistics --- 3 requests completed in 2159.1 ms, 1508 iops, 5.9 mb/s min/avg/max/mdev = 0.6/0.7/0.7/0.1 ms root@dev-samm:/mnt # ioping pmt2/ 4096 bytes from pmt2/ (ext4 /dev/xvdc1): request=1 time=0.6 ms 4096 bytes from pmt2/ (ext4 /dev/xvdc1): request=2 time=0.8 ms --- pmt2/ (ext4 /dev/xvdc1) ioping statistics --- 2 requests completed in 1658.4 ms, 1470 iops, 5.7 mb/s min/avg/max/mdev = 0.6/0.7/0.8/0.1 ms root@dev-samm:/mnt # ioping pmt3/ 4096 bytes from pmt3/ (ext4 /dev/xvde1): request=1 time=0.6 ms 4096 bytes from pmt3/ (ext4 /dev/xvde1): request=2 time=0.9 ms 4096 bytes from pmt3/ (ext4 /dev/xvde1): request=3 time=0.9 ms ...

July 24, 2015 · 3 min · 456 words · Sam McLeod

Delayed Serial STONITH

A modified version of John Sutton’s rcd_serial cable coupled with our Supermicro reset switch hijacker: This works with the rcd_serial fence agent plugin. Reasons rcd_serial makes for a very good STONITH mechanism: It has no dependency on power state. It has no dependency on network state. It has no dependency on node operational state. It has no dependency on external hardware. It costs less that $5 + time to build. It is incredibly simple and reliable. Essentially the most common STONITH agent type in use is probably those that control UPS / PDUs, while this sounds like a good idea in theory there are a number of issues with relying on a UPS / PDU: ...

July 21, 2015 · 3 min · 450 words · Sam McLeod

Video - Cluster Failover Performance Demo

July 12, 2015 · 0 min · 0 words · Sam McLeod

CentOS 7 and HA

First some background… One of the many lessons I’ve learnt from my Linux HA / Storage clustering project is that the Debian HA ecosystem is essentially broken, We reached the point where packages were too old, too buggy or in Debian 8’s case - outright missing. In the past I was very disappointed with RHEL/CentOS 5 / 6 and (until now) have been quite satisfied with Debian as a stable server distribution with historicity more modern packages and kernels. ...

July 7, 2015 · 3 min · 558 words · Sam McLeod

SSD Storage Cluster - Update and Diagram

Due to several recent events beyond my control I’m a bit behind on the project - hence the lack of updates which I apologise for. The goods news is that I’m back working to finish off the clusters and I’m happy to report that all is going to plan. Here is the final digram of the two-node cluster design: Plain text version available here This was generated from the LCMC tool (beware - it’s java!). ...

June 17, 2015 · 1 min · 79 words · Sam McLeod

Video - Storage Cluster Failover Demo

A brief demonstration of the failover and recovery process on the storage clusters I’ve been building.

May 14, 2015 · 1 min · 16 words · Sam McLeod

Talk - High Performance Software Defined Storage

A high level talk from Infracoders Melbourne on 12/04/2015. There’s also a low quality recording available here: Related posts: Building a high performance SSD SAN - Part 1

April 15, 2015 · 1 min · 28 words · Sam McLeod

Building a high performance SSD SAN - Part 1

Over the coming month I will be architecting, building and testing a modular, high performance SSD-only storage solution. I’ll be documenting my progress / findings along the way and open sourcing all the information as a public guide. With recent price drops and durability improvements in solid state storage now is better time than any to ditch those old magnets. Modular server manufacturers such as SuperMicro have spent large on R&D thanks to the ever growing requirements from cloud vendors that utilise their hardware. ...

February 16, 2015 · 8 min · 1590 words · Sam McLeod

Direct-Attach SSD Storage - Performance & Comparisons

Further to my earlier post on XenServer storage performance with regards to directly attaching storage from the host, I have been analysing the performance of various SSD storage options. I have attached a HP DS2220sb storage blade to an existing server blade and compared performance with 4 and 6 SSD RAID-10 to our existing iSCSI SANs. While the P420i RAID controller in the DS2220sb is clearly saturated and unable to provide throughput much over 1,100MB/s - the IOP/s available to PostgreSQL are still a very considerably performance improvement over our P4530 SAN - in fact, 6 SSD’s result in a 39.9x performance increase! ...

February 15, 2015 · 1 min · 110 words · Sam McLeod

XenServer, SSDs & VM Storage Performance

Intro At Infoxchange we use XenServer as our Virtualisation of choice. There are many reasons for this including: Open Source. Offers greater performance than VMware. Affordability (it’s free unless you purchase support). Proven backend Xen is very reliable. Reliable cross-host migrations of VMs. The XenCentre client, (although having to run in a Windows VM) is quick and simple to use. Upgrades and patches have proven to be more reliable than VMware. OpenStack while interesting, is not yet reliable or streamlined enough for our small team of 4 to implement and manage. XenServer Storage & Filesystems Unfortunately the downside to XenServer is that it’s underlying OS is quite old. The latest version (6.5) about to be released is still based on Centos 5 and still lacks any form of EXT4 and BTRFS support, direct disk access is not available… without some tweaking and has no real support for TRIM unless you have direct disk access and are happy with EXT3. ...

February 15, 2015 · 5 min · 970 words · Sam McLeod