[ale] Bad SATA interactions

Sun Nov 4 19:23:14 EST 2012

On 11/04/2012 05:56 PM, David Tomaschik wrote:
> Not sure how a filesystem-level checksum would help with corruption on
> the wire, other than to prevent reading back bad blocks.  During the
> write, you're pretty much trusting what's there, unless you want to
> read back the data and verify the checksum immediately, in which case
> you're talking about a seek immediately after each block write.  Good
> luck with "performance" on that.  As you pointed out, the UDMA CRC was
> catching this problem.  Do you think any data was corrupted due to
> this bad cable?

Know so.

Yesterday, I ran a series of tests.  Even decompressed the data from its 
origin drive, which worked.  See, at first I thought maybe it was a 
software problem.  So I compiled gzip statically on an Ubuntu system 
that could decompress the original data set, scp'd that to my box here, 
and it still had the error.  Okay, software problems/bugs are now 
eliminated.

Next thing I figured was, since gzip doesn't use LOTS of memory, it 
might have just had the misfortune of landing on bad RAM every time it 
loaded, so I ran a memtest.  Nope, nada.

During the copy TO my internal drive, the internal drive found and 
flagged errors internally, but never returned an error status to the 
operating system.  WTF is the point of that behavior?  It just chugged 
right along.  So at this point, I'm thinking that I have a bad drive. 
(At this point, I hadn't checked SMART yet, either, because I was 
operating under the erroneous assumption that all modern distros do so 
for you.)  But since I had no other conclusions, I thought I would check 
it manually.  Went to run smartctl and... got a command not found error 
message.

Well, that explained a fair bit!

So I installed that stuff and ran it, and it's error log was full (5 
entries is all it holds) and so I ran a full self test and went to bed. 
  Self test and surface scan was perfectly fine.  So, I concluded then 
that it must be the cable.

Swapped the cable, and the UDMA error count stopped increasing, two 
short of what the drive firmware considers "dying".  Heh.

At that point I tried decompressing the data, and still had the same 
problem.

Solution?

# touch *
# rsync --inplace --no-whole-file -av /path/to/orig /path/to/corrupt

... which corrected all the errors and then I was finally able to 
decompress.

I would have decompressed it to my drive to work around the problem, 
except that would have just created new ones.  ;-)

Really, there are two things that would have made this better: (a) the 
drive should have reported error status back to the operating system 
during the write in which it detected the error, because then I would 
have known IMMEDIATELY that something was wrong.  (b) When reading it 
back, a checksum would have said "hey your data is corrupt" instead of 
the drive saying "all good" and gzip going "format violated".

I know that checksums wouldn't help at write time, but they would sure 
clarify the errors at read time.  I'm still confused, though, as to why 
the drive didn't yell loudly.  Why didn't I get an I/O error abort if 
the drive bloody well knew that it got corrupted data?

	--- Mike

-- 
A man who reasons deliberately, manages it better after studying Logic
than he could before, if he is sincere about it and has common sense.
                                    --- Carveth Read, “Logic”