<p dir="ltr">I've seen this exact issue before on centos systems. My memory is just not providing a clue on what that was.</p>
<div class="gmail_quote">On Mar 28, 2016 1:06 PM, "Todor Fassl" <<a href="mailto:fassl.tod@gmail.com">fassl.tod@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The partitions are not full. That was the first thing I checked. I thought maybee someone was filling up /tmp. But I sshed to a machine while the root fs was mounted read-only and it wasn't full. I even set up nagios to watch the percent full via snmp and it never showed a problem.<br>
<br>
The root filesystem, /tmp, and /var are on seperate partitions. But the reason the logs just cut off may be that when it remounts / read-only, it also remounts /var read-only. There is nothing unusual in the logs and then they just stop.<br>
<br>
On the other hand, while my memory is a little fuzzy but I am pretty sure that I was able to mount an nfs share from a different server when a machine was in the in-between state. Somebody might have been logged into it at the time. I'm doing all this remotely because the machines are several floors away. But if it was unable to write log records, I shouldn't have been able to do an nfs mount at all. Next time I catch a machine in the in-between state, I'll check if /tmp and /var are munted read-only.<br>
<br>
<br>
<br>
<br>
<br>
<br>
On 03/28/2016 11:34 AM, Jim Kinney wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The root dir is NOT NFS mounted so that's a red-herring that you can't<br>
mount the /home later. If /var is not writeable, the system will hang<br>
as it can't log any more. Mounting requires a log entry<br>
Since it's not happening all at once to all the machines it really<br>
smells like a local machine problem. Verify that the drive is not full.<br>
Check to see if the affected machines are on the power circuit.<br>
Is it the same 2-3 each time? If so, run memtest and badblocks. If swap<br>
gets corrupted, Linux system lock up.<br>
On Mon, 2016-03-28 at 10:54 -0500, Todor Fassl wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I have a mysterious problem with workstations in a shared use<br>
environment. There are 2 labs in different buildings, onewith 6<br>
workstations and one with 8. These workstations are used by a group<br>
of<br>
about 30 grad student TAs. All are running ubuntu 15.10.<br>
Authentication<br>
is via ldap and home directories are mounted via nfs. Every day, 2<br>
or<br>
3 of the machines go down. The earliest symptom I can find is that<br>
the<br>
root filesystem is remounted read-only. Soon they stop responding<br>
to<br>
ssh and snmp and they are essentially locked up. They still respond<br>
to<br>
pings though.<br>
<br>
I've caught the machines in the period where the root system is<br>
read-only but I can still ssh to them. I've found that I cannot nfs<br>
mount home directories on our file server. I can mount nfs shares<br>
on<br>
other servers. And I can mount the same home directories if I go to<br>
another workstation. Restarting nfs on the file server has no effect.<br>
<br>
When I try to mount a home directory on an effected machine, the<br>
mount<br>
just hangs. I ran it with strace and it just showed it was waiting<br>
--<br>
for what, I'm not sure and I don't have a screen cap available at<br>
the<br>
moment. I put a packet sniffer on the server and it showed it<br>
received a<br>
single packet from the client and that's it.<br>
<br>
There is nothing in the logs on the client. In fact, they simply stop<br>
at<br>
some point in the process. At first I attributed this to the root<br>
filesystem being read-only but it continues after I move /var to a<br>
separate file system. At some point it just stops writing records to<br>
the<br>
syslog but I don't know if it's before or after the root filesystem<br>
is<br>
remounted read-only.<br>
<br>
Many of the TAs also have identical workstations in their offices.<br>
None<br>
of those machines seem to have this problem. The TAs do tend to<br>
walk<br>
away from the workstations w/o logging out. But I wrote a script to<br>
kill<br>
off their sessions and it didn't help. I had it send me an email<br>
whenever it killed somebody's session and it doesn't seem to be<br>
correlated with that. In other words, sometimes machines go down even<br>
if<br>
everyone who has used it has remembered to log out.<br>
<br>
I'm pretty desperate. Any ideas?<br>
<br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
<br>
<br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
</blockquote></blockquote>
<br>
-- <br>
Todd<br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
</blockquote></div>