[ale] One for the archives

Jeff Lightner jlightner at water.com
Mon Mar 5 08:43:02 EST 2007


Even on a read it "writes" access time bits to directories and files.
The only way to prevent it would be to mount it completely as read only
so it can't even change access time.

-----Original Message-----
From: ale-bounces at ale.org [mailto:ale-bounces at ale.org] On Behalf Of
To: ale at ale.org
James P. Kinney III
Sent: Monday, March 05, 2007 8:25 AM
To: Atlanta Linux Enthusiasts
Subject: Re: [ale] One for the archives

On Mon, 2007-03-05 at 07:50 -0500, Jeff Lightner wrote:
> Bacula doesn't have catalog backups?   In NetBackup this is an
important
> thing.  You can later reinstall the software then restore the most
> recent catalog backup and voila all your tapes are there.

Bacula does have catalog backups. They are on tape. The screwup on my
end was not having them also somewhere else. DLT7000 is painfully slow
to do a bscan recovery from.
> 
> RAID5 - I'm assuming you meant software RAID?  I've seen many a
failure
> of systems that didn't affect hardware RAID in the way you're
> describing.

Yep. Software RAID5. What I _don't_ understand is how a power fail event
can scramble the existing drives when all they are doing is a read. It
seems like a nasty bug that allows a write process to exist during what
seems to me to be a read only process on all drives but the replaced
drive being rebuilt.
> 
> -----Original Message-----
> From: ale-bounces at ale.org [mailto:ale-bounces at ale.org] On Behalf Of
> James P. Kinney III
> Sent: Saturday, March 03, 2007 9:27 PM
> To: Atlanta Linux Enthusiasts
> Subject: [ale] One for the archives
> 
> A server got hosed because of the following series of failures. Since
> the final step was a major "GOTCHA", I am sharing it here now so that
> others can avoid the pain later.
> 
> Background:
> 
> Main SOHO server with SCSI card for tape backup (old DLT 7000) and x4
> 200GB SATA in a software RAID setup. The main data storage area (a big
> samba share spot) was stored across all 4 drives in a RAID 5 array.
> 
> System hiccups and reports a failed drive (it won't spin up at all).
No
> problem. Not a hot-swap system so it is taken down, the drive replaced
> and the system rebooted to run-level 1. Console screen tail
> in /proc/mdstat shows system is doing a drive recovery/repair onto the
> new hard drive. Everything looks good.
> 
> After some period of time (approximately 10-20 minutes) the system is
> seen REBOOTING!
> 
> It was assumed that all was OK as after the reboot, no forced
filesystem
> checks occurred. It was quite odd that the server would shut down like
> that. About 2-3 minutes later, it rebooted itself again.
> 
> At this time it was determined that the power supply was failing.
> 
> It was replaced.
> 
> Later, it was determined that almost all of the files in the samba
share
> section were scrambled. And the backup application had lost all of
it's
> config files and the backup catalog (bacula).
> 
> Then the database failed to start.
> 
> Panic begins to creep in. The power blink during the hard drive
recovery
> had caused apparently massive damage to the storage systems.
> 
> A new drive and fresh OS was installed. The old RAID arrays were
mounted
> in order to extract what was usable from the samba shares. Email files
> recovered OK as well as home directories. But the samba shares were
> still screwball as well as all the backup system catalog and database.
> 
> So the process was begun to extract the backup catalog off the tapes.
> Searching for the catalog files is a painfully laborious task on a
> poky-slow tape drive when there are 21 tapes to sift through.
> 
> While the backups were being hunted down, calendar time continues on
and
> several weeks go by with no working backups (only one tape drive and
it
> spent all day "collecting it's thoughts" for recovery). A file from
the
> samba share was discovered to be clearly scrambled and worthless (an
> installation disk for an application that had been stored with an md5
> checksum). So it was deleted since the disk was available and it would
> need to be recopied anyway. 
> 
> The delete took a long time to return from.
> 
> The entire filesystem had been deleted.
> 
> Everything. All files. 
> 
> The file was deleted from within containing directory using the
command
> rm <filename> and then answering "yes" to the "are you sure" prompt.
> 
> As far as can be discerned, the file corruption was bad enough that
the
> delete process was redirected to another point in the filesystem where
> massive deletion occurred.
> 
> The moral of this story is three-fold:
> 
> 1. Bare-metal recovery of the backup system is both hard and more
> important than air.
> 
> 2. Any filesystem that becomes corrupted because of a RAID 5
malfunction
> should not be trusted at all under any circumstances. It should be
> removed from the system and overwritten immediately and the contents
> recovered from backups.
> 
> 3. Any time a drive fails in a RAID system, go ahead and replace the
> power supply for safety reasons. Unless it is a redundant power supply
> (this was not) it will certainly cost less that the antacid bill on
> this.
> 
> -- 
> James P. Kinney III          
> CEO & Director of Engineering 
> Local Net Solutions,LLC        
> 770-493-8244                    
> http://www.localnetsolutions.com
> 
> GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
> <jkinney at localnetsolutions.com>
> Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale
-- 
James P. Kinney III          
CEO & Director of Engineering 
Local Net Solutions,LLC        
770-493-8244                    
http://www.localnetsolutions.com

GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
<jkinney at localnetsolutions.com>
Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7



More information about the Ale mailing list