<html><head></head><body><div>I just lost 62TB of data in a drive array.</div><div><br></div><div>"But Jim!?", you ask knowing Jim isn't usually totally stupid enough to paint himself into a corner and set up a condition to cause a data loss of any size, much less one of 62TB, "How did this happen?".</div><div><br></div><div>Step 1. Use a known good, respected RAID card from LSI (2108, SAS2, 8-lane external, RAID6 support - have 2 others in production)</div><div>Step 2. Use a mix of known decent hard drives for their size and class (Toshiba, HGST in 4TB SAS2) - 28 total.</div><div>Step 3. RAID the pile into a RAID6 array inside a redundant SAS2 JBOD box</div><div>Step 4. Use LVM2 to then split the 95TB space into multiple 10TB thinpools and each thinpool with a single thin logical volume of 10TB. </div><div>Step 5. Build another one of these system to match</div><div>Step 6. Use Gluster to create nearly 100TB of redundant storage split between the two machines.</div><div>Step 7. Use the storage cluster for a bit over a year until a single drive dies early.</div><div>Step 8. Replace the failed drive with a new one</div><div>Step 9. Watch in horror as the RAID subsystem locks up and the thinpool metadata gets scrambled</div><div>Step 10. Diagnose the failure:</div><div><br></div><div>Step 0 {</div><div>Weeks of study into the complete workings of LVM, RAID and Gluster prior to the hardware purchase. The RAID card was already in use in 2 other systems with great success. Dead drives were easily popped out, new ones slammed in, auto-recovery commences and a day or two latter all is fine.</div><div><br></div><div>Gluster is a pretty useful way to provide failover ability in a storage cluster. Not too hard to setup. Has a few performance gotcha's but overall, plenty of capability for the need.</div><div><br></div><div>LVM2 is a pretty well known and understood entity even with the addition of thinpool (recommended by gluster as a way to easily extend space to bricks "on the fly"). Thin pool stores physical extents for logical volumes and allocates them only when written. Thin logical volumes are sparse and be virtual sized larger than actual space. Thin pool can be expanded without touching thin volumes. Yeah sounds like a great match.</div><div>}</div><div><br></div><div>Didn't read the section in the docs about deleting thin pool, thin volumes as I was _creating_ them. Would have seen the blurb about " vgcfgbackup does not back up thin pool metadata."</div><div><br></div><div>Hmm. That is not good. Apparently there is only a SINGLE copy of thinpool metadata and there's no way to back it up. WTF?!?!?</div><div><br></div><div>LVM2 does have a sequence of backup and archival of metadata but it's not useful for thinpool. It looks like it might be possible to do some funky-ugly like swap thinpool metatada with/into a new LV, take a snapshot of the new LV, then replay the snapshot back to the original thinpool but no one actually has any real "yeah, this works" process mapped out anywhere.</div><div><br></div><div>There are a few tools to do some repair (lvchange --repair VG/thinpool) and they have greatly encouraging words like "If the repair does not work, the thin pool LV and its thin LVs are lost.". Yep. That sounds more like reality.</div><div><br></div><div>Apparently, LSI Megaraid needs a firmware update to play nice with the kernels in CentOS 7. Hmm. That was updated when the card went in. So the new firmware causes behavior different than the documented drive replacement methodology. Nice.</div><div><br></div><div>The only bright spot in this is the other half of the storage cluster mirror is doing just fine. But now I get to format 90+TB and copy back over 60+TB and then tell gluster to take over and resync everything. At least I have a 40Gbps ethernet connection between the two machines. (5 days of file copies with no one using the remaining mirror - not going to happen).</div><div><br></div><div>My new mantra: Always have enough decent Scotch on hand to handle any occasion.</div><div><br></div><div>The warning: Always break things before you depend on them. It's easier to fix stuff that you already know the insides of when you have a total failure. Read ALL the docs. When it's infrastructure level stuff, get real experience breaking things every way possible and then fixing them before you get stuck having to do it live. Backups/duplicates are a crutch that are essential to have. </div><div><span><pre><pre>-- <br></pre>James P. Kinney III
Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain
http://heretothereideas.blogspot.com/
</pre></span></div></body></html>