I agree. There is no publicly available mechanism for an ECC memory error event to make it to the kernel. When those events occur, there is (usually) a way for that information to be stored in nvram on the affected dimm itself. Accessing that area requires specialty kernel code that varies by RAM maker. While I _have_ seen that code in operation, unless you work for a large search engine that custom builds their own hardware and runs Linux for everything, you're out of luck getting access to that data. And that code is not publicly available.<br>
<br>That said, there may be some bios level processes that can analyze memory faults. I know I've seen that on some older compaq and newer IBM hardware.<br><br><div class="gmail_quote">On Tue, Dec 15, 2009 at 10:01 AM, Michael H. Warfield <span dir="ltr"><<a href="mailto:mhw@wittsend.com">mhw@wittsend.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">On Tue, 2009-12-15 at 00:00 -0500, Jeff Hubbs wrote:<br>
> OK, but being ECC RAM, wouldn't something have shown up in<br>
> /var/log/kernel? How could I tell other than using FSM-style faith?<br>
<br>
</div> I don't believe there's a specific interrupt or error upon memory<br>
parity or ECC failure. I think it generates an NMI (Non Maskable<br>
Interrupt) but a lot of things could generate that error (Error:<br>
Unexpected NMI. Dazed and confused but trying to continue anyways). I<br>
don't know if there's an indication in a memory controller somewhere or<br>
not about that. Might depend on your hardware. Obviously, once you<br>
take a non-recoverable memory hit, everything becomes suspect.<br>
<div class="im"><br>
> Jim Kinney wrote:<br>
> > Bad ECC RAM is still bad RAM. ECC can only correct a single bit flip<br>
> > in register. 2 bit flips and it's all toast.<br>
> ><br>
> > It does sound like Samba managed to totally corrupt itself and the<br>
> > hang later may have been related to the system thrashing ram around.<br>
> > The filesystem definitions are kernel space so samba has to access<br>
> > that to function. Just be restarting samba is a pretty good indication<br>
> > that it was memory associated with the samba process. The aggressive<br>
> > caching of the kernel will amplify a bad memory situation. Restarting<br>
> > samba will cause teh samba caching to also restart and that may have<br>
> > overwritten the bad data portion which was related to the filesystem<br>
> > management area.<br>
<br>
</div> Mike<br>
<font color="#888888"><br>
--<br>
Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw@WittsEnd.com<br>
/\/\|=mhw=|\/\/ | (678) 463-0932 | <a href="http://www.wittsend.com/mhw/" target="_blank">http://www.wittsend.com/mhw/</a><br>
NIC whois: MHW9 | An optimist believes we live in the best of all<br>
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!<br>
</font><br>_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>-- <br>James P. Kinney III<br>Actively in pursuit of Life, Liberty and Happiness <br><br>