<p dir="ltr">Doesn't it? Especially when nearly 200 of them occur. I would at least expect a small percentage to get back. </p>
<p dir="ltr">This board only does AHCI mode, so I am at a loss... Sad thing is that I have an early model BD-RE burner that doesn't work in ACHI mode (locks up hard until power is removed and the device is talked to in "old style" mode) so I had to remove it. </p>
<p dir="ltr">If I knew enough to debug, I would. But I am completely unfamiliar with the workings of the bus, and don't have time to figure out the why, sadly, if the behavior is out of spec... </p>
<p dir="ltr">Isn't the world of technology grand? :-) </p>
<div class="gmail_quote">On Nov 4, 2012 8:15 PM, "David Tomaschik" <<a href="mailto:david@systemoverlord.com">david@systemoverlord.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On Sun, Nov 4, 2012 at 4:23 PM, <a href="mailto:mike@trausch.us">mike@trausch.us</a> <<a href="mailto:mike@trausch.us">mike@trausch.us</a>> wrote:<br>
> On 11/04/2012 05:56 PM, David Tomaschik wrote:<br>
>><br>
>> Not sure how a filesystem-level checksum would help with corruption on<br>
>> the wire, other than to prevent reading back bad blocks. During the<br>
>> write, you're pretty much trusting what's there, unless you want to<br>
>> read back the data and verify the checksum immediately, in which case<br>
>> you're talking about a seek immediately after each block write. Good<br>
>> luck with "performance" on that. As you pointed out, the UDMA CRC was<br>
>> catching this problem. Do you think any data was corrupted due to<br>
>> this bad cable?<br>
><br>
><br>
> Know so.<br>
><br>
> Yesterday, I ran a series of tests. Even decompressed the data from its<br>
> origin drive, which worked. See, at first I thought maybe it was a software<br>
> problem. So I compiled gzip statically on an Ubuntu system that could<br>
> decompress the original data set, scp'd that to my box here, and it still<br>
> had the error. Okay, software problems/bugs are now eliminated.<br>
><br>
> Next thing I figured was, since gzip doesn't use LOTS of memory, it might<br>
> have just had the misfortune of landing on bad RAM every time it loaded, so<br>
> I ran a memtest. Nope, nada.<br>
><br>
> During the copy TO my internal drive, the internal drive found and flagged<br>
> errors internally, but never returned an error status to the operating<br>
> system. WTF is the point of that behavior? It just chugged right along.<br>
> So at this point, I'm thinking that I have a bad drive. (At this point, I<br>
> hadn't checked SMART yet, either, because I was operating under the<br>
> erroneous assumption that all modern distros do so for you.) But since I<br>
> had no other conclusions, I thought I would check it manually. Went to run<br>
> smartctl and... got a command not found error message.<br>
><br>
> Well, that explained a fair bit!<br>
><br>
> So I installed that stuff and ran it, and it's error log was full (5 entries<br>
> is all it holds) and so I ran a full self test and went to bed. Self test<br>
> and surface scan was perfectly fine. So, I concluded then that it must be<br>
> the cable.<br>
><br>
> Swapped the cable, and the UDMA error count stopped increasing, two short of<br>
> what the drive firmware considers "dying". Heh.<br>
><br>
> At that point I tried decompressing the data, and still had the same<br>
> problem.<br>
><br>
> Solution?<br>
><br>
> # touch *<br>
> # rsync --inplace --no-whole-file -av /path/to/orig /path/to/corrupt<br>
><br>
> ... which corrected all the errors and then I was finally able to<br>
> decompress.<br>
><br>
> I would have decompressed it to my drive to work around the problem, except<br>
> that would have just created new ones. ;-)<br>
><br>
> Really, there are two things that would have made this better: (a) the drive<br>
> should have reported error status back to the operating system during the<br>
> write in which it detected the error, because then I would have known<br>
> IMMEDIATELY that something was wrong. (b) When reading it back, a checksum<br>
> would have said "hey your data is corrupt" instead of the drive saying "all<br>
> good" and gzip going "format violated".<br>
><br>
> I know that checksums wouldn't help at write time, but they would sure<br>
> clarify the errors at read time. I'm still confused, though, as to why the<br>
> drive didn't yell loudly. Why didn't I get an I/O error abort if the drive<br>
> bloody well knew that it got corrupted data?<br>
><br>
> --- Mike<br>
<br>
<br>
Erm, yeah, silently dropping corrupt commands is kinda crappy. Of<br>
course, then you start to run into the two generals problem:[1] how<br>
can the drive be sure error messages are getting back to the<br>
controller?<br>
<br>
Actually, doesn't SATA require some sort of ACK from the drive?<br>
There's an error register specifically in AHCI mode[2] that should<br>
report back CRC failures.<br>
<br>
I'm wondering if it's a case of crappy drive firmware, but it seems<br>
odd that it would update smart registers and not report back to the<br>
OS...<br>
<br>
<br>
<br>
[1] <a href="https://en.wikipedia.org/wiki/Two_Generals'_Problem" target="_blank">https://en.wikipedia.org/wiki/Two_Generals'_Problem</a><br>
[2] <a href="http://wiki.osdev.org/AHCI" target="_blank">http://wiki.osdev.org/AHCI</a><br>
--<br>
David Tomaschik<br>
OpenPGP: 0x5DEA789B<br>
<a href="http://systemoverlord.com" target="_blank">http://systemoverlord.com</a><br>
<a href="mailto:david@systemoverlord.com">david@systemoverlord.com</a><br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
</blockquote></div>