[ale] XFS on Linux - Is it ready for prime time?

Fri Apr 23 18:40:05 EDT 2010

I decided to double check the 1 in 100 Trillion byte error rate I used.

I randomly choose the WD20EARS drive which is one of the WD green
drives.  The spec sheet is at:

http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-701229.pdf

Non-recoverable read errors per bits read      ------       <1 in 10^15

Note that's bits, not bytes.  So that's about 125 * 10^12 bytes.

So my numbers are 20% worse than the real spec.  Still unacceptable.

uhg...  I got curious and checked their "black" line.  <1 in 10^14, so
there 8 times worse than my analysis.   That's just scary.

Greg

On Fri, Apr 23, 2010 at 6:27 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
> Raid 5 with 10GB drives is relatively reliable.  Raid 5 with 2 TB
> disks is not acceptable to me.
>
> The reason is that the non-recoverable error rate has not changed much.
>
> Lets says its one error 100 Trillion bytes.  ie. 100 * 10^12  (that's
> at least on the right order of magnatude I'm pretty sure.)
>
> Sounds very impressive, and it used to be.  With a 10 GB drive, the
> odds of non-recoverable error being on a drive are about 10,000 to 1
> (ie. 100 Trillion / 10 Billion = 10,000).
>
> So if you have a 4 disk raid-5 and one drive fails, there is only a 3
> in 10,000 chance of you having a non-recoverable error on the other 3
> disks combined.
>
> When that happens, mdraid fails the rebuild.  ie. When rebuilding a
> raid-5 with a newly added drive, it requires perfection of the
> remaining drives.  Lacking perfection, it fails the rebuild.
>
> But fast forward to 2 TB drives.  Now the odds of a non-recoverable
> error are one in 50.  (ie. 100 Trillion / 2 Trillion is only 50).
>
> So for any given 2TB drive, there is a one in 50 chance that somewhere
> on that drive is non-readable data.
>
> Now build that same 4-disk raid-5.  Given a failure, there is now a 3
> in 50 chance of the rebuild failing!!!!!
>
> That's about 1 in 17.  I find that way too low, and will not even
> consider building raid-5 arrays from that class of drive.
>
> solutions:
>
> 1) Use a more reliable raid solution.  Raid-6 would be the least
> overhead I would consider.
>
> 2) The new 4K physical sector drives that are starting to ship are
> supposed to lead to increasing reliability, so that one error in 100
> Trillion may increase significantly, but when I checked a couple specs
> in Feb, they were still in the one error in 100 Trillion range.
>
> But I hope that improves as these new drives prove themselves. If it
> jumps from 100 * 10^12 to 100 * 10^15, raid 5 in my mind will be
> viable again.
>
> 3) Aggressive use of proactive raid-5 scans to identify bad sectors
> and rebuild them from the other drives prior to a drive failure.
> Sounds smart, but I don't know if the quoted error rates assume you
> are checking them routinely,  or are they assuming stable data that
> hasn't been written/read in an extended time.  Since I don't know, my
> assumption is error rate applies even if you are routinely scanning
> for bad sectors.
>
> fyi: the above calculations were caused because I saw so many real
> world raid5's fail to rebuild that I wanted to see if I could
> understand what was going on.
>
> Okay, torture is over.  Resume your normal lives.
> Greg
>
> On Thu, Apr 22, 2010 at 9:13 PM, Doug McNash <dmcnash at charter.net> wrote:
>>
>> So, you think we shouldn't be using Raid5, huh. I asked why Raid5 and they said they wanted the reliability. XFS was chosen because of its reputation for handling large (140M) video files and associated metadata. And yes it is NFS.
>>
>> The big remaining problem is that periodically the system stalls and may take a few seconds to send back the write acks. At that point the writer assumes the file is not being written and starts tossing data.
>>
>> Is this stalling the nature of a Raid5?, XFS? and would it be improved by different choices?
>> --
>> doug mcnash
>>
>> ---- scott mcbrien <smcbrien at gmail.com> wrote:
>>> A lot of the XFS guys from SGI now work for RH.  But that's an aside.
>>> More to the point is how many machines are simultaneously accessing
>>> said NAS?  If it's designed for a single system to access, then just
>>> use something like ext3.  If you're needing multiple simultaneous
>>> accesses with file locking to avoid the problems that occur with
>>> multiple machines opening and closing files, you might try GFS.  A
>>> couple of other things
>>>
>>> 1)  You probably don't want to use RAID 5.  RAID 5 has data throughput
>>> issues, especially for large stripe units and/or small file changes.
>>> I know the alternatives aren't that attractive, the most likely being
>>> RAID10 or RAID0+1 because they require 2x as many discs.  But for the
>>> additional expense, you'll get much more throughput.
>>>
>>> 2) One might want a different file system because of the data stored
>>> on the filesystem.  reiserfs for example is really good at storing
>>> copious amounts of small files, where GFS is good for multiple machine
>>> accesses, while ext3 is just solid for average people and process
>>> access on a single machine.
>>>
>>> 3) RAID 5 isn't high performance.
>>>
>>> 4)  I'm guessing that they're sharing the filesystem via NFS, you
>>> might want to make sure the NFS server is properly tuned and the
>>> clients aren't doing anything insane to corrupt your data
>>>
>>> 5)  You really need to move off of RAID5
>>>
>>> -Scott
>>>
>>> On Wed, Apr 21, 2010 at 10:15 PM, Jim Kinney <jim.kinney at gmail.com> wrote:
>>> > How odd. I started using xfs before it was a native thing in redhat (pre
>>> > RHEL stuff, pre ext3 days). It seemed to always be solid and reliable. It
>>> > was provided by SGI ( and all the port was provided by SGI as well) and it
>>> > had a solid track record as the file system that was suitable for huge
>>> > amounts of data (moving video files was common use). It worked on all of my
>>> > stuff for all RAID I threw at it. It was imperative to install the xfs-tools
>>> > to work with it but it sounds like you already have it. If xfs-check is
>>> > dying due to ram issues, I would be more suspicious of bad hard drives than
>>> > the xfs code. If there has been a ton of write/delete/write cycles on the
>>> > drives then the journalling may be corrupted. I'm not sure how to fix that.
>>> >
>>> > On Wed, Apr 21, 2010 at 9:34 PM, Doug McNash <dmcnash at charter.net> wrote:
>>> >>
>>> >> I'm consulting at a company that wants to turn their Linux based NAS in to
>>> >> a reliable product.  They initially chose XFS because they were under the
>>> >> impression that it was high performance but what they got was something of
>>> >> questionable reliability. I have identified and patched several serious bugs
>>> >> (2.6.29) and I have a feeling there are more unidentified ones out there.
>>> >> Furthermore, xfs_check craps out of memory every time so we have to do an
>>> >> xfs_repair at boot and it takes forever.  But today we got into a situation
>>> >> where xfs_repair can't repair the disk (a raid5 array btw).
>>> >>
>>> >> Does anyone out there use xfs? How about a suggestion for a stable
>>> >> replacement.
>>> >> --
>>> >> doug mcnash
>>> >> _______________________________________________
>>> >> Ale mailing list
>>> >> Ale at ale.org
>>> >> http://mail.ale.org/mailman/listinfo/ale
>>> >> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> >> http://mail.ale.org/mailman/listinfo
>>> >
>>> >
>>> >
>>> > --
>>> > --
>>> > James P. Kinney III
>>> > Actively in pursuit of Life, Liberty and Happiness
>>> >
>>> >
>>> > _______________________________________________
>>> > Ale mailing list
>>> > Ale at ale.org
>>> > http://mail.ale.org/mailman/listinfo/ale
>>> > See JOBS, ANNOUNCE and SCHOOLS lists at
>>> > http://mail.ale.org/mailman/listinfo
>>> >
>>> >
>>>
>>> _______________________________________________
>>> Ale mailing list
>>> Ale at ale.org
>>> http://mail.ale.org/mailman/listinfo/ale
>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> http://mail.ale.org/mailman/listinfo
>>
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
>
>
>
> --
> Greg Freemyer
> Head of EDD Tape Extraction and Processing team
> Litigation Triage Solutions Specialist
> http://www.linkedin.com/in/gregfreemyer
> CNN/TruTV Aired Forensic Imaging Demo -
>   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/
>
> The Norcross Group
> The Intersection of Evidence & Technology
> http://www.norcrossgroup.com
>

-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com