Flash Storage and SSD Failure Rate Update (March 2018)

It was almost 3 years ago that my open source storage project went into production. In that time it’s been running 24/7 serving as highly available solid state storage for hundreds of VMs and several virtualisation clusters across our two main sites. I’m happy to report that the clusters have been operating very successfully since their conception. Since moving away from proprietary ‘black box’ vendor SANs, we haven’t had a single SAN issue, storage outage. ...

March 20, 2018 · 2 min · 326 words · Sam McLeod

Talk - Clustered, Distributed File and Volume Storage with GlusterFS

Using GlusterFS to provide volume storage to Kubernetes as a replacement for our existing file and static content hosting. This talk was given at Infracoders on Tuesday 14th November 2017. NOTE: Below link to slides currently broken - will fix soon! (03/08/2019) Click below to view slides (PDF version): Direct download link

November 14, 2017 · 1 min · 52 words · Sam McLeod

GlusterFS

We’re in the process of shifting from using our custom ‘glue’ for orchestrating Docker deployments to Kubernetes, When we first deployed Docker to replace LXC and our legacy Puppet-heavy application configuration and deployment systems there really wasn’t any existing tool to manage this, thus we rolled our own, mainly a few Ruby scripts combined with a Puppet / Hiera / Mcollective driven workflow. The main objective is to replace our legacy NFS file servers used to host uploads / attachments and static files for our web applications, while NFS(v4) performance is adequate, it is a clear single point of failure and of course, there are the age old stale mount problems should network interruptions occur. ...

September 25, 2017 · 6 min · 1106 words · Sam McLeod

Update Delayed Serial STONITH Design

note: This is a follow up post from 2015-07-21-rcd-stonith A Linux Cluster Base STONITH provider for use with modern Pacemaker clusters This has since been accepted and merged into Fedora’s code base and as such will make it’s way to RHEL. Source Code: Github Diptrace CAD Design: Github I have open sourced the CAD circuit design and made this available within this repo under CAD Design and Schematics Related RedHat Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1240868 v1 vs v2/v3 versions of the rcd_serial STONITH system The v2/v3 cables include the following improvements: ...

July 4, 2016 · 2 min · 217 words · Sam McLeod

Benchmarking IO with FIO

This is a quick tldr there are many other situations and options you could consider FIO man page IOP/s = Input or Output operations per second Throughput = How many MB/s can you read/write continuously Variables worth tuning based on your situation --iodepth The iodepth is very dependant on your hardware. Rotational drives without much cache and high latency (i.e. desktop SATA drives) will not benefit from a large iodepth, Values between 16 to 64 could be sensible. ...

April 29, 2016 · 2 min · 393 words · Sam McLeod

Fix XenServer SR with corrupt or invalid metadata

If a disk / VDI is orphaned or only partially deleted you’ll notice that under the SR it’s not assigned to any VM. This can cause issues that look like metadata corruption resulting in the inability to migrate VMs or edit storage. For example: [root@xenserver-host ~]# xe vdi-destroy uuid=6c2cd848-ac0e-441c-9cd6-9865fca7fe8b Error code: SR_BACKEND_FAILURE_181 Error parameters: , Error in Metadata volume operation for SR. [opterr=VDI delete operation failed for parameters: /dev/VG_XenStorage-3ae1df17-06ee-7202-eb92-72c266134e16/MGT, 6c2cd848-ac0e-441c-9cd6-9865fca7fe8b. Error: Failed to write file with params [3, 0, 512, 512]. Error: 5], Removing stale VDIs To fix this, you need to remove those VDIs from the SR after first deleting the logical volume: ...

January 18, 2016 · 2 min · 296 words · Sam McLeod

iSCSI SCSI-ID / Serial Persistence

“Having a SCSI ID is a f*cking idiotic thing to do.” - Linus Torvalds …and after the amount of time I’ve wasted getting XenServer to play nicely with LIO iSCSI failover I tend to agree. The Problem One oddity of Xen / XenServer’s storage subsystem is that it identifies iSCSI storage repositories via a calculated SCSI ID rather than the iSCSI Serial - which would be the sane thing to do. ...

December 14, 2015 · 3 min · 626 words · Sam McLeod

How to cluster and failover (almost) anything - An intro to Pacemaker and Corosync

Slides Failover Demo

November 9, 2015 · 1 min · 3 words · Sam McLeod

SAN Intro

October 7, 2015 · 0 min · 0 words · Sam McLeod

SSD Storage - Two Months In Production

Over the last two months I’ve been running selected IO intensive servers off the the SSD storage cluster, these hosts include (among others) our: Primary Puppetmaster Gitlab server Redmine app and database servers Nagios servers Several Docker database host servers Reliability We haven’t had any software or hardware failures since commissioning the storage units. During this time we have had 3 disk failures on our HP StoreVirtual SANs that have required us to call the supporting vendor and replace failed disks. ...

September 13, 2015 · 2 min · 376 words · Sam McLeod