[ale] Failed drives on SuperMicro server backplane

Fri Oct 23 09:45:26 EDT 2009

Looks like I picked a bad time to re-join the list and ask a 
Linux-related question, but I just wanted to follow up...

I've been getting led through the use of tw_cli (3ware userspace config 
utility) via IRC and it looks like it's the 3ware SATA card that may 
have failed and not the two drives.  However, I am told that I should 
not proceed debugging/replacing/etc. without first updating the firmware 
on the Seagate drives themselves (ST31000340AS).  To their substantial 
credit, Seagate has a downloadable ISO to make a bootable CD that will 
flash the firmware from any working PC with SATA.  So, now that the 
array is unmounted and the md devices stopped, I'm going to pull the 
remaining six drives and the two spares in the server as well as the 
"failed" drives and update the firmware on all 10 of them. 

I am really starting to question my decision to use kernel RAID for 
these arrays.  It makes for faster arrays, but no one is really going to 
appreciate having to "man mdadm" in order to figure out how to fail-mark 
and remove drives from arrays before pulling them and again to add the 
replacements and rebuild the arrays.  Even *I* would exchange being able 
to "slam sleds and forget" for some I/O rate. 

Opinion?

Jeff Hubbs wrote:
> I've had two of eight SATA drives on a 3ware 9550 card fail due to a 
> protracted overtemp condition (SPOF HVAC).
>
> The eight drives are arranged in kernel RAID1 pairs and the four pairs 
> are then kernel RAID0ed (yes, it flies).  The two failed drives are in 
> different pairs (thank goodness) so the array stayed up.  I've used 
> mdadm --fail and mdadm --remove to properly mark and take out the bad 
> drives and I've replaced them with on-hand spares. 
>
> The problem is that even with the new drives in, I don't have a usable 
> sde or sdk anymore.  For instance:
>
>    # fdisk /dev/sde
>    Unable to read /dev/sde
>
> [note:  I've plugged spare drives into another machine and they fdisk 
> there just fine]
>
> In my critical log I've got "raid1: Disk failure on sde, disabling 
> device" and another such message for sdk...is there a way I can 
> re-enable them w/o a reboot?
>
> Two related questions:
> This array is in a SuperMicro server with a 24-drive backplane in the 
> front.  When the two SATA drives failed, there was no LED indication 
> anywhere.  In looking at the backplane manual, there are six I2C 
> connectors that are unused, and I only have the defaults for I2C support 
> in the kernel.  The manual also says that the backplane can use I2C or 
> SGPIO.  Is there a way I can get red-LED-on-drive-failure function (red 
> LEDs come on briefly on the whole backplane at power-on)?
>
> I've set this array and one other 14-drive on on this machine up using 
> whole disks - i.e., /dev/sde instead of /dev/sde1 of type fd.  How 
> good/bad is that idea?  One consideration is that I'm wanting to be able 
> to move the arrays to another similar machine in case of a whole-system 
> failure and have the arrays just come up; so far, that has worked fine 
> in tests.
>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
>