[ale] Failed drives on SuperMicro server backplane
Jeff Hubbs
jhubbslist at att.net
Fri Oct 23 09:45:26 EDT 2009
Looks like I picked a bad time to re-join the list and ask a
Linux-related question, but I just wanted to follow up...
I've been getting led through the use of tw_cli (3ware userspace config
utility) via IRC and it looks like it's the 3ware SATA card that may
have failed and not the two drives. However, I am told that I should
not proceed debugging/replacing/etc. without first updating the firmware
on the Seagate drives themselves (ST31000340AS). To their substantial
credit, Seagate has a downloadable ISO to make a bootable CD that will
flash the firmware from any working PC with SATA. So, now that the
array is unmounted and the md devices stopped, I'm going to pull the
remaining six drives and the two spares in the server as well as the
"failed" drives and update the firmware on all 10 of them.
I am really starting to question my decision to use kernel RAID for
these arrays. It makes for faster arrays, but no one is really going to
appreciate having to "man mdadm" in order to figure out how to fail-mark
and remove drives from arrays before pulling them and again to add the
replacements and rebuild the arrays. Even *I* would exchange being able
to "slam sleds and forget" for some I/O rate.
Opinion?
Jeff Hubbs wrote:
> I've had two of eight SATA drives on a 3ware 9550 card fail due to a
> protracted overtemp condition (SPOF HVAC).
>
> The eight drives are arranged in kernel RAID1 pairs and the four pairs
> are then kernel RAID0ed (yes, it flies). The two failed drives are in
> different pairs (thank goodness) so the array stayed up. I've used
> mdadm --fail and mdadm --remove to properly mark and take out the bad
> drives and I've replaced them with on-hand spares.
>
> The problem is that even with the new drives in, I don't have a usable
> sde or sdk anymore. For instance:
>
> # fdisk /dev/sde
> Unable to read /dev/sde
>
> [note: I've plugged spare drives into another machine and they fdisk
> there just fine]
>
> In my critical log I've got "raid1: Disk failure on sde, disabling
> device" and another such message for sdk...is there a way I can
> re-enable them w/o a reboot?
>
> Two related questions:
> This array is in a SuperMicro server with a 24-drive backplane in the
> front. When the two SATA drives failed, there was no LED indication
> anywhere. In looking at the backplane manual, there are six I2C
> connectors that are unused, and I only have the defaults for I2C support
> in the kernel. The manual also says that the backplane can use I2C or
> SGPIO. Is there a way I can get red-LED-on-drive-failure function (red
> LEDs come on briefly on the whole backplane at power-on)?
>
> I've set this array and one other 14-drive on on this machine up using
> whole disks - i.e., /dev/sde instead of /dev/sde1 of type fd. How
> good/bad is that idea? One consideration is that I'm wanting to be able
> to move the arrays to another similar machine in case of a whole-system
> failure and have the arrays just come up; so far, that has worked fine
> in tests.
>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
>
More information about the Ale
mailing list