[ale] One for the archives

Mon Mar 5 20:54:56 EST 2007

On Mon, 2007-03-05 at 08:43 -0500, Jeff Lightner wrote:
> Even on a read it "writes" access time bits to directories and files.
> The only way to prevent it would be to mount it completely as read only
> so it can't even change access time.

I'll have to dig in the source code on software raid. atime should not
be adjusted during a sync process as this is below the filesystem level.
> 
> -----Original Message-----
> From: ale-bounces at ale.org [mailto:ale-bounces at ale.org] On Behalf Of
> James P. Kinney III
> Sent: Monday, March 05, 2007 8:25 AM
> To: Atlanta Linux Enthusiasts
> Subject: Re: [ale] One for the archives
> 
> On Mon, 2007-03-05 at 07:50 -0500, Jeff Lightner wrote:
> > Bacula doesn't have catalog backups?   In NetBackup this is an
> important
> > thing.  You can later reinstall the software then restore the most
> > recent catalog backup and voila all your tapes are there.
> 
> Bacula does have catalog backups. They are on tape. The screwup on my
> end was not having them also somewhere else. DLT7000 is painfully slow
> to do a bscan recovery from.
> > 
> > RAID5 - I'm assuming you meant software RAID?  I've seen many a
> failure
> > of systems that didn't affect hardware RAID in the way you're
> > describing.
> 
> Yep. Software RAID5. What I _don't_ understand is how a power fail event
> can scramble the existing drives when all they are doing is a read. It
> seems like a nasty bug that allows a write process to exist during what
> seems to me to be a read only process on all drives but the replaced
> drive being rebuilt.
> > 
> > -----Original Message-----
> > From: ale-bounces at ale.org [mailto:ale-bounces at ale.org] On Behalf Of
> > James P. Kinney III
> > Sent: Saturday, March 03, 2007 9:27 PM
> > To: Atlanta Linux Enthusiasts
> > Subject: [ale] One for the archives
> > 
> > A server got hosed because of the following series of failures. Since
> > the final step was a major "GOTCHA", I am sharing it here now so that
> > others can avoid the pain later.
> > 
> > Background:
> > 
> > Main SOHO server with SCSI card for tape backup (old DLT 7000) and x4
> > 200GB SATA in a software RAID setup. The main data storage area (a big
> > samba share spot) was stored across all 4 drives in a RAID 5 array.
> > 
> > System hiccups and reports a failed drive (it won't spin up at all).
> No
> > problem. Not a hot-swap system so it is taken down, the drive replaced
> > and the system rebooted to run-level 1. Console screen tail
> > in /proc/mdstat shows system is doing a drive recovery/repair onto the
> > new hard drive. Everything looks good.
> > 
> > After some period of time (approximately 10-20 minutes) the system is
> > seen REBOOTING!
> > 
> > It was assumed that all was OK as after the reboot, no forced
> filesystem
> > checks occurred. It was quite odd that the server would shut down like
> > that. About 2-3 minutes later, it rebooted itself again.
> > 
> > At this time it was determined that the power supply was failing.
> > 
> > It was replaced.
> > 
> > Later, it was determined that almost all of the files in the samba
> share
> > section were scrambled. And the backup application had lost all of
> it's
> > config files and the backup catalog (bacula).
> > 
> > Then the database failed to start.
> > 
> > Panic begins to creep in. The power blink during the hard drive
> recovery
> > had caused apparently massive damage to the storage systems.
> > 
> > A new drive and fresh OS was installed. The old RAID arrays were
> mounted
> > in order to extract what was usable from the samba shares. Email files
> > recovered OK as well as home directories. But the samba shares were
> > still screwball as well as all the backup system catalog and database.
> > 
> > So the process was begun to extract the backup catalog off the tapes.
> > Searching for the catalog files is a painfully laborious task on a
> > poky-slow tape drive when there are 21 tapes to sift through.
> > 
> > While the backups were being hunted down, calendar time continues on
> and
> > several weeks go by with no working backups (only one tape drive and
> it
> > spent all day "collecting it's thoughts" for recovery). A file from
> the
> > samba share was discovered to be clearly scrambled and worthless (an
> > installation disk for an application that had been stored with an md5
> > checksum). So it was deleted since the disk was available and it would
> > need to be recopied anyway. 
> > 
> > The delete took a long time to return from.
> > 
> > The entire filesystem had been deleted.
> > 
> > Everything. All files. 
> > 
> > The file was deleted from within containing directory using the
> command
> > rm <filename> and then answering "yes" to the "are you sure" prompt.
> > 
> > As far as can be discerned, the file corruption was bad enough that
> the
> > delete process was redirected to another point in the filesystem where
> > massive deletion occurred.
> > 
> > The moral of this story is three-fold:
> > 
> > 1. Bare-metal recovery of the backup system is both hard and more
> > important than air.
> > 
> > 2. Any filesystem that becomes corrupted because of a RAID 5
> malfunction
> > should not be trusted at all under any circumstances. It should be
> > removed from the system and overwritten immediately and the contents
> > recovered from backups.
> > 
> > 3. Any time a drive fails in a RAID system, go ahead and replace the
> > power supply for safety reasons. Unless it is a redundant power supply
> > (this was not) it will certainly cost less that the antacid bill on
> > this.
> > 
> > -- 
> > James P. Kinney III          
> > CEO & Director of Engineering 
> > Local Net Solutions,LLC        
> > 770-493-8244                    
> > http://www.localnetsolutions.com
> > 
> > GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
> > <jkinney at localnetsolutions.com>
> > Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > http://www.ale.org/mailman/listinfo/ale
-- 
James P. Kinney III          
CEO & Director of Engineering 
Local Net Solutions,LLC        
770-493-8244                    
http://www.localnetsolutions.com

GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
<jkinney at localnetsolutions.com>
Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part