[ale] Bad SATA interactions
mike at trausch.us
mike at trausch.us
Sun Nov 4 19:23:14 EST 2012
On 11/04/2012 05:56 PM, David Tomaschik wrote:
> Not sure how a filesystem-level checksum would help with corruption on
> the wire, other than to prevent reading back bad blocks. During the
> write, you're pretty much trusting what's there, unless you want to
> read back the data and verify the checksum immediately, in which case
> you're talking about a seek immediately after each block write. Good
> luck with "performance" on that. As you pointed out, the UDMA CRC was
> catching this problem. Do you think any data was corrupted due to
> this bad cable?
Know so.
Yesterday, I ran a series of tests. Even decompressed the data from its
origin drive, which worked. See, at first I thought maybe it was a
software problem. So I compiled gzip statically on an Ubuntu system
that could decompress the original data set, scp'd that to my box here,
and it still had the error. Okay, software problems/bugs are now
eliminated.
Next thing I figured was, since gzip doesn't use LOTS of memory, it
might have just had the misfortune of landing on bad RAM every time it
loaded, so I ran a memtest. Nope, nada.
During the copy TO my internal drive, the internal drive found and
flagged errors internally, but never returned an error status to the
operating system. WTF is the point of that behavior? It just chugged
right along. So at this point, I'm thinking that I have a bad drive.
(At this point, I hadn't checked SMART yet, either, because I was
operating under the erroneous assumption that all modern distros do so
for you.) But since I had no other conclusions, I thought I would check
it manually. Went to run smartctl and... got a command not found error
message.
Well, that explained a fair bit!
So I installed that stuff and ran it, and it's error log was full (5
entries is all it holds) and so I ran a full self test and went to bed.
Self test and surface scan was perfectly fine. So, I concluded then
that it must be the cable.
Swapped the cable, and the UDMA error count stopped increasing, two
short of what the drive firmware considers "dying". Heh.
At that point I tried decompressing the data, and still had the same
problem.
Solution?
# touch *
# rsync --inplace --no-whole-file -av /path/to/orig /path/to/corrupt
... which corrected all the errors and then I was finally able to
decompress.
I would have decompressed it to my drive to work around the problem,
except that would have just created new ones. ;-)
Really, there are two things that would have made this better: (a) the
drive should have reported error status back to the operating system
during the write in which it detected the error, because then I would
have known IMMEDIATELY that something was wrong. (b) When reading it
back, a checksum would have said "hey your data is corrupt" instead of
the drive saying "all good" and gzip going "format violated".
I know that checksums wouldn't help at write time, but they would sure
clarify the errors at read time. I'm still confused, though, as to why
the drive didn't yell loudly. Why didn't I get an I/O error abort if
the drive bloody well knew that it got corrupted data?
--- Mike
--
A man who reasons deliberately, manages it better after studying Logic
than he could before, if he is sincere about it and has common sense.
--- Carveth Read, “Logic”
More information about the Ale
mailing list