<p dir="ltr">Doesn&#39;t it? Especially when nearly 200 of them occur. I would at least expect a small percentage to get back. </p>

<p dir="ltr">This board only does AHCI mode, so I am at a loss... Sad thing is that I have an early model BD-RE burner that doesn&#39;t work in ACHI mode (locks up hard until power is removed and the device is talked to in &quot;old style&quot; mode) so I had to remove it. </p>


<p dir="ltr">If I knew enough to debug, I would. But I am completely unfamiliar with the workings of the bus, and don&#39;t have time to figure out the why, sadly, if the behavior is out of spec... </p>

<p dir="ltr">Isn&#39;t the world of technology grand? :-) </p>

<div class="gmail_quote">On Nov 4, 2012 8:15 PM, &quot;David Tomaschik&quot; &lt;<a href="mailto:david@systemoverlord.com">david@systemoverlord.com</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Sun, Nov 4, 2012 at 4:23 PM, <a href="mailto:mike@trausch.us">mike@trausch.us</a> &lt;<a href="mailto:mike@trausch.us">mike@trausch.us</a>&gt; wrote:<br>

&gt; On 11/04/2012 05:56 PM, David Tomaschik wrote:<br>

&gt;&gt;<br>

&gt;&gt; Not sure how a filesystem-level checksum would help with corruption on<br>

&gt;&gt; the wire, other than to prevent reading back bad blocks.  During the<br>

&gt;&gt; write, you&#39;re pretty much trusting what&#39;s there, unless you want to<br>

&gt;&gt; read back the data and verify the checksum immediately, in which case<br>

&gt;&gt; you&#39;re talking about a seek immediately after each block write.  Good<br>

&gt;&gt; luck with &quot;performance&quot; on that.  As you pointed out, the UDMA CRC was<br>

&gt;&gt; catching this problem.  Do you think any data was corrupted due to<br>

&gt;&gt; this bad cable?<br>

&gt;<br>

&gt;<br>

&gt; Know so.<br>

&gt;<br>

&gt; Yesterday, I ran a series of tests.  Even decompressed the data from its<br>

&gt; origin drive, which worked.  See, at first I thought maybe it was a software<br>

&gt; problem.  So I compiled gzip statically on an Ubuntu system that could<br>

&gt; decompress the original data set, scp&#39;d that to my box here, and it still<br>

&gt; had the error.  Okay, software problems/bugs are now eliminated.<br>

&gt;<br>

&gt; Next thing I figured was, since gzip doesn&#39;t use LOTS of memory, it might<br>

&gt; have just had the misfortune of landing on bad RAM every time it loaded, so<br>

&gt; I ran a memtest.  Nope, nada.<br>

&gt;<br>

&gt; During the copy TO my internal drive, the internal drive found and flagged<br>

&gt; errors internally, but never returned an error status to the operating<br>

&gt; system.  WTF is the point of that behavior?  It just chugged right along.<br>

&gt; So at this point, I&#39;m thinking that I have a bad drive. (At this point, I<br>

&gt; hadn&#39;t checked SMART yet, either, because I was operating under the<br>

&gt; erroneous assumption that all modern distros do so for you.)  But since I<br>

&gt; had no other conclusions, I thought I would check it manually.  Went to run<br>

&gt; smartctl and... got a command not found error message.<br>

&gt;<br>

&gt; Well, that explained a fair bit!<br>

&gt;<br>

&gt; So I installed that stuff and ran it, and it&#39;s error log was full (5 entries<br>

&gt; is all it holds) and so I ran a full self test and went to bed.  Self test<br>

&gt; and surface scan was perfectly fine.  So, I concluded then that it must be<br>

&gt; the cable.<br>

&gt;<br>

&gt; Swapped the cable, and the UDMA error count stopped increasing, two short of<br>

&gt; what the drive firmware considers &quot;dying&quot;.  Heh.<br>

&gt;<br>

&gt; At that point I tried decompressing the data, and still had the same<br>

&gt; problem.<br>

&gt;<br>

&gt; Solution?<br>

&gt;<br>

&gt; # touch *<br>

&gt; # rsync --inplace --no-whole-file -av /path/to/orig /path/to/corrupt<br>

&gt;<br>

&gt; ... which corrected all the errors and then I was finally able to<br>

&gt; decompress.<br>

&gt;<br>

&gt; I would have decompressed it to my drive to work around the problem, except<br>

&gt; that would have just created new ones.  ;-)<br>

&gt;<br>

&gt; Really, there are two things that would have made this better: (a) the drive<br>

&gt; should have reported error status back to the operating system during the<br>

&gt; write in which it detected the error, because then I would have known<br>

&gt; IMMEDIATELY that something was wrong.  (b) When reading it back, a checksum<br>

&gt; would have said &quot;hey your data is corrupt&quot; instead of the drive saying &quot;all<br>

&gt; good&quot; and gzip going &quot;format violated&quot;.<br>

&gt;<br>

&gt; I know that checksums wouldn&#39;t help at write time, but they would sure<br>

&gt; clarify the errors at read time.  I&#39;m still confused, though, as to why the<br>

&gt; drive didn&#39;t yell loudly.  Why didn&#39;t I get an I/O error abort if the drive<br>

&gt; bloody well knew that it got corrupted data?<br>

&gt;<br>

&gt;         --- Mike<br>

<br>

<br>

Erm, yeah, silently dropping corrupt commands is kinda crappy.  Of<br>

course, then you start to run into the two generals problem:[1]  how<br>

can the drive be sure error messages are getting back to the<br>

controller?<br>

<br>

Actually, doesn&#39;t SATA require some sort of ACK from the drive?<br>

There&#39;s an error register specifically in AHCI mode[2] that should<br>

report back CRC failures.<br>

<br>

I&#39;m wondering if it&#39;s a case of crappy drive firmware, but it seems<br>

odd that it would update smart registers and not report back to the<br>

OS...<br>

<br>

<br>

<br>

[1] <a href="https://en.wikipedia.org/wiki/Two_Generals&#39;_Problem" target="_blank">https://en.wikipedia.org/wiki/Two_Generals&#39;_Problem</a><br>

[2] <a href="http://wiki.osdev.org/AHCI" target="_blank">http://wiki.osdev.org/AHCI</a><br>

--<br>

David Tomaschik<br>

OpenPGP: 0x5DEA789B<br>

<a href="http://systemoverlord.com" target="_blank">http://systemoverlord.com</a><br>

<a href="mailto:david@systemoverlord.com">david@systemoverlord.com</a><br>

_______________________________________________<br>

Ale mailing list<br>

<a href="mailto:Ale@ale.org">Ale@ale.org</a><br>

<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>

See JOBS, ANNOUNCE and SCHOOLS lists at<br>

<a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>

</blockquote></div>