[ale] Which large capacity drives are you having the best luck with?
Greg Freemyer
greg.freemyer at gmail.com
Wed Jan 5 19:23:10 EST 2011
See interspersed:
But first, have you looked at the data smart is tracking:
Try "smartctl -a /dev/sda " on your machine and get a feel for it if
you want to delve this deep.
The big issue is that smart implementation varies by manufacturer for
sure, and I think by model and even firmware.
So understanding what these fields mean is very difficult. But if
you're a home user with a small number of drives to worry about, you
could record your full smart data dump every 6 months or so and get a
feel for how different fields are growing, etc.
=== fyi: on my desktop at work ===
> sudo /usr/sbin/smartctl -a /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10 family
Device Model: ST3250310AS
Serial Number: 9RY00PYW
Firmware Version: 3.AAA
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Jan 5 18:29:58 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 92) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 100 006 Pre-fail
Always - 127975369
3 Spin_Up_Time 0x0003 098 097 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 65
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail
Always - 330102083
9 Power_On_Hours 0x0032 069 069 000 Old_age
Always - 27871
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 65
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0022 068 051 045 Old_age
Always - 32 (Lifetime Min/Max 21/36)
194 Temperature_Celsius 0x0022 032 049 000 Old_age
Always - 32 (0 20 0 0)
195 Hardware_ECC_Recovered 0x001a 069 061 000 Old_age
Always - 140010573
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age
Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age
Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
====
The first thing I look at is POH (Power on Hours). In this case
27,871. This field has been pretty reliable in my experience to be
exactly what it says. So my drive is not exactly new.
Then look at Reallocated_Sector_Ct. Mine is zero. That's cool.
But Hardware_ECC_Recovered is 140,010,573. That may sound large, but
remember, the reads succeeded because of the ECC data, so there is no
data loss. I tend to agree with you that as magnetism fades for a
sector, checksum failures increase and ECC recovery is needed.
Spinrite used as you describe may keep that value lower.
But I don't think spinrite tries to detect sectors that have bee ECC
recovered. So it doesn't really know the details.
A smart long self test has the ability to know that a ECC recovery is
needed for a sector. What it does with the knowledge, I don't know.
But it certainly has more knowledge to work with than spinrite.
fyi: hdparm has a long read capability that allows a full physical
sector to be read with no error correction! So spinrite could in
theory read all of the sectors with CRC verification disabled and
check the CRC itself. The trouble is the the drive manufactures
implement proprietary CRC / ECC solutions, so spinrite has no way to
actually delve into the details of the sectors data accuracy.
On Wed, Jan 5, 2011 at 5:33 PM, Ron Frazier
<atllinuxenthinfo at c3energy.com> wrote:
> Hi Pat,
>
> We're getting a little above my level of knowledge on hard drive
> operation here, but here's my take on it. A modern drive is always
> generating read errors and relying on ECC to get you the right data.
Mine has 140 million ECC corrections in 27,000 hr POH. So that's
about 5,000 per hour, or more than one a second!
I totally agree with your statement!
> It can be readable enough, with difficulty, without generating an error
> flag.
Agreed
> Therefore, they may not be flagged unless they get above a
> certain threshold.
Not flagged by who? Smart is keeping track. Spinrite / linux is not.
> When Spinrite tries to read the data, if it has
> difficulty above a certain limit, I believe it relocates the data
> somewhere else.
AIUI - False. If the sector is ECC correctable, spinrite has no idea.
If the sector is not-ECC correctable the DRIVE marks the sector as
pending relocation. See the smart Current_Pending_Sector field. For
each unreadable sector, this is increased by 1.
When that sector is written (by linux or spinrite) then the sector is
reallocated to a spare sector. And the old sector is not used again.
fyi: hdparm has a way to force a write to Pending Sector and put new
good data on it. Thus spinrite could do this if it wanted to as well.
I certainly hope it is not doing so.
> This may or may not raise a flag in the smart system.
It does. see above.
> It also doesn't mean that the sector has been reallocated.
You imply a sector can be moved without it being reallocated. I think
that is wrong. The only way to move the sector is to allocate a spare
and use it instead of the original.
> The
> intensive analysis of Spinrite inverts the data and writes it back,
> reads it, inverts it again to its original state, then writes it back
> again.
That is nice because it should allow the drive to identify magnetic
"holes". When found the drive itself is likely doing the spare sector
allocation.
> This forces the drive's firmware to evaluate the performance at
> that point, and forces the surface to absorb both a 1 and 0 in turn at
> that point. Also, I believe that the magnetic fields deteriorate over
> time. I could probably corroborate that with some extensive research.
Agreed, but I often store hard drives offline for extended periods.
We rarely see read failures for drives we put back on line. So the
deteriation is very slow and not likely to be an issue.
fyi: The DOD uses thermite in the drive platter area to heat the media
to several hundred degrees. When this happens the magnetism is
released and the data is gone.
> Just anecdotally though, most of the files I've ever lost due to disk
> malfunctions seem to be things that were almost never accessed except
> rarely.
Somewhat logical. The drives "smart" function doesn't get exposed to
those sectors, so as the sector degrades / fails, it doesn't know
about it.
Especially with laptop drives, you get physical damage as the flying
head hits the platters from time to time. To protect the platters,
they are often actually coated with a fine coat of diamond dust.
That's one reason laptop drives cost more.
> The read invert write read invert write cycle, if nothing else,
> will ensure that all the magnetic bits are good and strong since they
> are all ultimately rewritten.
True, but I think normal degradation is much slower than you imply.
> There are basically 3 possibilities prior to running the diagnostic:
>
> 1) The drive has no serious (above threshold) errors either latent or
> obvious. - In this case, every magnetic domain on the surface will be
> refreshed, which is good, and will keep the read error rate as far below
> the threshold limit as possible. Also, the firmware will be forced to
> evaluate the performance of every bit and byte.
For a drive you've treated with spinrite, what's your ECC_Recovered / POH ratio.
ie. Mine is 5000 recoveries per power on hour. And I don't do
anything to "maintain" it. This is just my desktop machine.
> 2) The drive has latent errors but no warnings. - There may be areas
> that are barely readable or that are not readable under normal
> circumstances (but have never been attempted). They will be read if
> possible after extensive effort, and will be relocated to a different
> part of the drive if needed. This may or may not cause a sector
> reallocation or generate any error messages. Again, the magnetic fields
> will be refreshed.
>
> 3) The drive has obvious errors and warnings. - In this case it is
> likely that some data is unreadable by conventional means. It is highly
> likely that Spinrite will recover the data and save it elsewhere on the
> drive, storing it in fresh strong magnetic domains.
I believe a smart long self test will read all of the sectors and
identify those that are not ECC Recoverable. I don't think it will
actually reallocate them.
What spinrite likely does is read the sector in various ways. ie many
data recovery tools can read the sectors in reverse order. This
causes the drive to align the head slightly differently I believe.
Due to that slight change, some bad sectors can be read. So I
actually do think spinrite could have some logic to do this that
normal read logic would not have.
> Again, this may or
> may not trigger sector reallocation.
I surely hope writing to a sector previously had read failures not
handle-able via ECC recovery triggers a reallocate.
> Spinrite will report these data
> areas as recovered or unrecovered as appropriate. The drive itself may
> still be fully usable, if, for example, the data error was caused by a
> power failure, but the drive was not damaged. If sectors start getting
> reallocated, I would agree that it's time to consider changing the drive
> out, as I did with one of mine last night.
I'm not so sure I agree. A lot of reallocates are just physical
platter issues. It used to be that drives shipped new with lots
reallocated sectors.
Admittedly, new ones tend to have zero anymore.
> Regardless, Spinrite can
> often recover your data enough to boot the drive and get the data off
> before decommissioning the drive. The smart reporting screen of
> Spinrite is somewhat hard to read, and I don't know if it reports sector
> reallocation. I would use the Ubuntu Disk Utility or gsmartcontrol /
> smartctl as a final check to look for warning signs (now that I know
> about it) even if Spinrite is happy.
>
> I'm not suggesting that everyone has to use the product, just sharing
> some info that I feel might be helpful. I have found the product useful
> in the past. To each his own.
>
> Sincerely,
>
> Ron
Greg
More information about the Ale
mailing list