[ale] HDD failure modes, why your drive might NOT read or write your data

Fri Dec 28 14:22:28 EST 2012

Hi guys,

I have a recurring interest in the reliability of HDD's for those that I 
own as well as some I maintain for family members.  I recently had to 
replace two 1 TB drives that started throwing reallocated sector errors 
at about the same time after 3 years of operation.  It was only a 
coincidence that I was doing my biannual extensive hard drive 
maintenance at the same time they started throwing a tantrum.  I do 
believe in regular backups, but it is very hard to keep all my hard 
drives backed up at all times.  I do have online backups every 6 hours 
for new data on most machines, but not for data that takes huge amounts 
of space.  I'm interested to know what you guys do on a personal level 
to monitor your hard drives health and, perhaps, preemptively replace 
them when they are starting to fail.  I have implemented a utility on 
the PC where those two 1 TB drives started to fail which will monitor 
reallocated sector counts and a couple of other things.  That utility 
must run in administrative mode (in Windows) however, so I can't run it 
on Dad's machine since he always runs with a standard user login.

I recently read an analogy of how the precision tolerances affect a 
modern HDD.  Imagine the platter is 3 miles wide.  Each track would be 
.4" wide.  The read write head would be a go cart flying above the 
platter at the width of a human hair.  And, the platter would be 
spinning at 3.6 million MPH.  Obviously, this is just an analogy, but 
it's amazing they work at all.

I've discovered some interesting data about how parts of the drive that 
were writable can become unwritable later and how data that was written 
correctly can become unreadable later.  These are called latent defects, 
or sometimes grown defects.  They are not discovered until a read or 
write error occurs, at which time the data may be unrecoverable.

Here's a link to a cool article.

http://entertainmentstorage.org/articles/Hard%20Disk%20Drives_%20The%20Good,%20The%20Bad%20and%20The%20Ugly.pdf

It's a bit dated, but has good info.

Note that the head fly height is .3uIN or less.  If my conversions are 
right, this equates to .0076 uM or 7.6 nM or 76 angstroms.  No matter 
how you say it, it's a VERY small space.

Note also that the drive is made of many dissimilar metals, which are 
machined, and which have dissimilar hardnesses and thermal expansion 
coefficients.

So, if there are any particles or abberations larger than 7 nM, they are 
subject to get trapped under the read write head.  The article points 
out that removing all particles this small is very difficult.

So, how could a defect appear or grow?

The article lists several ways.

1) Any vibration such as bumping the unit, walking across the floor, or 
even sound can alter the head position just enough to cause transient 
errors during reading or writing.  As we've discussed during a previous 
thread, the OS or drive controller doesn't generally do a read after 
write verify.  So, your software may be happily humming along writing 
data and not even know that it didn't get written properly.

For some really interesting reading, google "don't shout at your hard 
drive".  This will show some interesting new research on how, even 
sound, can screw up hard drives.

2) The head's fly height can be raised by the accumulation of lubricants.

3) from the article: "Media imperfections such as voids (pits), 
scratches, hydrocarbon contamination (various oils), and smeared soft 
particles can not only cause errors during writing, but also corrupt 
data after it has been written."

The types of problems listed in 3) can occur after the drive has been 
put into service.  If a particle that is softer than the platter media 
coating get's trapped by the head, it can get smeared along the 
surface.  If a particle that is harder than the surface gets trapped, it 
can scratch or gouge the surface.  This potentially ruins data that is 
there or prevents future writes in that area.

4) from the article: "Data can become corrupted any time the disks are 
spinning, even when data is not being written to or read from the disk. 
Common causes for erasure include thermal asperities, corrosion, and 
scratches or smears."

"Thermal asperities are instances of high heat for a short duration 
caused by head-disk contact. This is usually the result of heads hitting 
small "bumps" created by particles that remain embedded in the media 
surface even after burnishing and polishing. The heat generated on a 
single contact can be high enough to erase data.  Even if not on the 
first contact, cumulative effects of numerous contacts may be sufficient 
to thermally erase data or mechanically destroy the media coatings and 
erase data."

5) from the article: "Another problem associated with PMR [perpendicular 
magnetic recording] is side-track erasure. Changing the direction of the 
magnetic grains also changes the direction of the magnetic fields. PMR 
has a return field that is close to the adjacent tracks and can 
potentially erase data in those tracks. In general, the track spacing is 
wide enough to mitigate this mechanism, but if a particular track is 
written repeatedly, the probability of side-track erasure increases."

6) From a prior thread here, someone mentioned pending spindle bearing 
failures as a cause of poor / anomalous head / track alignment.

So, latent defects can occur because of vibrations, lubricants, fly 
height, pits, scratches, smears, thermal asperities, side track erasure, 
corrosion, and spindle bearings.  And, that assumes that all the major 
parts of the drive are working normally.

So, two things are apparent.  A) You cannot always assume that your data 
was written properly.  and B) You cannot always assume that data that 
was written properly can be read properly.

So, my questions are these:

What do you do, on your personal equipment, where you may have less 
resources than at work, to monitor for drive errors before they become 
catastrophic and catch them?

What do you do at work?

What could be done better?

Is there a way to force the OS, either Windows or Linux, to do verifies 
after each write operation?

I believe, now more than ever, that doing a full read write surface 
analysis on a drive a couple of times / year is a good idea, and then 
rewriting the data back after any NEW latent defects have been 
identified by the drive's controller.

The utility I found for Windows monitors reallocated sectors, pending 
sectors, and uncorrectable sectors (Smart attributes 5, c5, c6); as well 
as temperature; and sends me an email if they get high.  Are those good 
indicators of pending failures?

Would there be anything else I should do to detect failures before 
they're serious enough to compromise my data?

Sincerely,

Ron

-- 

(To whom it may concern.  My email address has changed.  Replying to former
messages prior to 03/31/12 with my personal address will go to the wrong
address.  Please send all personal correspondence to the new address.)

(PS - If you email me and don't get a quick response, you might want to
call on the phone.  I get about 300 emails per day from alternate energy
mailing lists and such.  I don't always see new email messages very quickly.)

Ron Frazier
770-205-9422 (O)   Leave a message.
linuxdude AT techstarship.com