<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
I have a graphic design client with a 2U server running Fedora 11 and
now 12 which is at a colo handling their backups. The server has 8
drives with Linux md raids & LVM on top of them. The primary
filesystems are ext4 and there is/was an LVM swap space.<br>
<br>
I've had an absolutely awful experience with these Seagate 1.5 TB
drives, returning 10 out of the original 14 due to the ever increasing
SMART "Reallocated_Sector_Ct" due to bad blocks. The server that the
client has at their office has a 3ware 9650(I think) that has done a
great job of handling the bad blocks from this same batch of drives and
sending email notifications of one of the drives that grew more and
more bad blocks. This 2U though is obviously pure software raid, and
it has started locking up.<br>
<br>
As a stabilizing measure, I've disable the swap space, hoping the
lockups were caused by failure to read/write from swap. I have yet to
let the server run over time and assess if this was successful.<br>
<br>
However, I'm doing a lot of reading today on how md & LVM handle
bad blocks and I'm really shocked. I found <a
href="http://linas.org/linux/raid.html">this article</a> (which may be
outdated) which claimed that md relies heavily on the firmware of the
disk to handle these problems and when rebuilding an array there are no
"common sense" integrity checks to assure that the right data is
reincorporated back into the healthy array. Then I've read more and
more articles about drives that were silently corrupting data. It's
turned my stomach. Btrfs isn't ready for a this, even though RAID5 was
very recently incorporated. And I don't see btrfs becoming a
production stable file system until 2011 at the earliest.<br>
<br>
Am I totally wrong about suspecting bad blocks for causing the
lock-ups? (syslog records nothing)<br>
Can md RAID be trusted with flaky drives?<br>
If it's the drives, then other than installing OpenSolaris and ZFS, how
to I make this server reliable?<br>
Any experiences with defeating mysterious lock-ups?<br>
<br>
Thanks!<br>
<br>
------------------------------SMART Data-----------------------------<br>
<font face="Courier New, Courier, monospace">[root@victory3 ~]# for
letter in a b c d e f g h ; do echo /dev/sd$letter; smartctl --all
/dev/sd$letter |grep Reallocated_Sector_Ct; done<br>
/dev/sda<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 8<br>
/dev/sdb<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 1<br>
/dev/sdc<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0<br>
/dev/sdd<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0<br>
/dev/sde<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 1<br>
/dev/sdf<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0<br>
/dev/sdg<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 1<br>
/dev/sdh<br>
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0</font><br>
</body>
</html>