[ale] new (to me) raid 4/5 failure mode

Mon Aug 24 21:14:45 EDT 2009

Greg Freemyer wrote:
> If I have in c-code:  write(); fsync();
> 
> That clearly takes some number of milliseconds.  This failure mode
> requires a unexpected failure during that time period.  fsync may
> shorten the window.  It is not capable of eliminating it.
> 

Your code will block on the fsync until the data is written to the
platter.  Databases and journaled file systems are able to use feature
guaranty that your data (or at least metadata, depending on how your
file system journals) is in a consistent state.

> agreed, but again, fsync takes real clock time.  It is during that
> time period you are vulnerable.

You are only as vulnerable as your application and filesystem allow you
to be.

>> Now, for data that has not been fsync'ed it is an entirely different story.
> 
> Every piece of data is not fsynced  for a minimum time period.  If it
> is 5 milliseocnds, then that is your window.

If data isn't fsynced there is never a guaranty that the data made it to
disk.  If your application isn't using fsync to made atomic changes to
its data then you are always at risk.

>> I do.  I fully expect my disks and controllers to honor any fsync calls.
>>  You can't have any true atomicity without a fully working fsync.
> 
> fsync is NOT atomic.  With 2 real physical disks you do not have the
> ability to synchronize  the writes such that parity and data are
> updated at exactly the same time.  There will always be a millisecond
> or two of vulnerability.

A call to fsync is not supposed to return to the application until the
data is physically on the platters.  You need more than fsync to
maintain atomicity.  A database would use strategic fsync calls to flush
a log and then flush a commit, which would guaranty atomicity for the
database write.

Even without fsync your jounralled file system at the very least uses
fsync to make atomic metadata writes.  Depending on the file system and
settings you may even be getting atomic data writes, without fsync.
Unfortunately, the application can't know if its data is safe without an
fsync.

> Exactly, but this failure mode is NOT like that.
> 
> This failure mode is causing data that is not part of the write
> process to be lost!
> ie. Assume I am writing to LBA n, and because of a power outage LBA
> n+64 is lost.  I don't think that can happen on a single drive setup.
> It is exactly what happens in the failure mode.

Modern drives certainly cache more than 64KB :).  You can potentially
lose anywhere from 8 to 32 MB of unwritten data during a power loss if
you have just a single drive.

> Since you talk about a video server below, with a normal failure mode,
> if you are recording "Lord of the Rings" when you lose power, then
> "Lord of the Rings" is corrupt and and you have to re-record it.  We
> all expect that.
> 
> With this failure mode, you would not only lose Lord of the Rings, but
> Star Wars might be on the same stripe and get some of its data blocks
> corrupted.

Its a good example!  I don't know that I can come up with a single disk
failure mode that could ever cause precisely that to happen.  It would
just be completely unlikely with any file system I can think of.

Unfortunately, it can happen with any striped RAID of any kind.  You can
take some precautions to make it less likely, though.  When you create
an ext2/3/4 file system you can tell it your stripe size.  That will
keep filesystem metadata aligned at stripe boundries, and I believe it
will also make an attempt to align files at stripe boundaries.

As near as my brain can come up with, there is no way you could have a
failure like this with RAID-z or RAID-z2.  I am going to keep pushing
this Kool Aid at you.  Its delicious :).

Pat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
Url : http://mail.ale.org/pipermail/ale/attachments/20090824/63edbf90/attachment.bin