<div dir="ltr"><div class="gmail_default" style="font-size:small">I'd check to see if your nfs server and nfs client software matches well and doesn't have any known bugs that could cause this. We had a problem with our NFS server locking up it's interactions with these workstations due to a nfs client/server bug. IIRC, since it was several years ago, that updating the clients to a newer version of Scientific Linux fixed the problem. The nfs server was Ubuntu 8.04 based and the Scientific Linux was based off of Centos 4 I think. It's been too long, but I think I have the info on the versions correct. What was weird was the ubuntu based and scientific linux workstations that weren't affected directly would continue working until the nfs server RAM filled up and the server couldn't flush data to disk. So, it seemed like someone would complain their workstation was locked up, but the root cause was not that workstation, but a different one. The server logs plus some serious web searching pointed to the nfs server/client mismatch as a possible problem. The scientific linux workstation got updated and the problem went away.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">Later we had problems with the nfs server getting overloaded with too many clients and too high of a load. But the mysterious lockups seemed related to a specific combination of nfs server/client software from different distributions and kernel versions.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">If the workstations are not on UPSes that can manage undervoltage or overvoltage, then you could have electrical problems as the root cause. We had good success on keeping workstation alive a long time by keeping them on decent UPSes that aren't overloaded.</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><font size="2">Sincerely,<br>Dow<br></font><font size="6"><span style="color:rgb(0,0,0);font-family:sans-serif;line-height:44.79999923706055px;background-color:rgb(249,249,249)">⚛</span></font><font size="2">Dow Hurst, Research Scientist<br> 340 Sullivan Science Bldg.</font><div><font size="2"> Dept. of Chem. and Biochem.<br> University of North Carolina at Greensboro<br> PO Box 26170 Greensboro, NC 27402-6170<br></font><br></div></div></div></div>
<br><div class="gmail_quote">On Mon, Mar 28, 2016 at 1:20 PM, Todor Fassl <span dir="ltr"><<a href="mailto:fassl.tod@gmail.com" target="_blank">fassl.tod@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
We've run every kind of hardware diagnostic we can think of. Besides, it's just these 14 machines in the 2 shared spaces. Identical machines in private offices don't seem to have any problem.H<br>
<br>
But, you're right. Ssome kind of power problem is the best theory I've seen for a while. The 2 rooms are in different buildings and they never had a problem before. But maybe somebody is plugging something in. Come to think of it, we had a similar problem years ago when a student put a microwave oven in his office. The computers on the other side of the wall kept going down. I don't know enough about electricity to explain that but the microwave oven and the computer were plugged into outlets on opposite sides of the same wall.<br>
<br>
What kind of gizmo would a grad student be bringing into a lab that would make linux workstations freeze up?<br>
<br>
Another reason this theory makes sense is that I haven't gotten a single complaint about the machines going down. You'd think if they were going down while people were using them, I'd get complaints. People are always logged in when they go down but that doesn't mean anything since they tend to walk away w/o logging out. I've looked for patterns in the list of users who were logged in whan a machine went down but didn't see any. I can't rule out that it's somebody doing something though. There might be a pattern and I just didn't see it. But I am sure there isn't one guy who is always logged in whan a machine goes down.<br>
<br>
On 03/28/2016 11:05 AM, James Taylor wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The most common, if not the only, reason I've seen partitions get marked read-only is when I've had power glitches that that caused a very brief interruption in connectivity to the drives.<br>
Normally that is not an issue with locally attached drives on workstations, but stranger things have happened.<br>
Are the workstations on UPS or is the power to the rooms conditioned properly.<br>
-jt<br>
<br>
<br>
James Taylor<br>
<a href="tel:678-697-9420" value="+16786979420" target="_blank">678-697-9420</a><br>
<a href="mailto:james.taylor@eastcobbgroup.com" target="_blank">james.taylor@eastcobbgroup.com</a><br>
<br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Todor Fassl <<a href="mailto:fassl.tod@gmail.com" target="_blank">fassl.tod@gmail.com</a>> 3/28/2016 11:54 AM >>><br>
</blockquote></blockquote></blockquote>
I have a mysterious problem with workstations in a shared use<br>
environment. There are 2 labs in different buildings, onewith 6<br>
workstations and one with 8. These workstations are used by a group of<br>
about 30 grad student TAs. All are running ubuntu 15.10. Authentication<br>
is via ldap and home directories are mounted via nfs. Every day, 2 or<br>
3 of the machines go down. The earliest symptom I can find is that the<br>
root filesystem is remounted read-only. Soon they stop responding to<br>
ssh and snmp and they are essentially locked up. They still respond to<br>
pings though.<br>
<br>
I've caught the machines in the period where the root system is<br>
read-only but I can still ssh to them. I've found that I cannot nfs<br>
mount home directories on our file server. I can mount nfs shares on<br>
other servers. And I can mount the same home directories if I go to<br>
another workstation. Restarting nfs on the file server has no effect.<br>
<br>
When I try to mount a home directory on an effected machine, the mount<br>
just hangs. I ran it with strace and it just showed it was waiting --<br>
for what, I'm not sure and I don't have a screen cap available at the<br>
moment. I put a packet sniffer on the server and it showed it received a<br>
single packet from the client and that's it.<br>
<br>
There is nothing in the logs on the client. In fact, they simply stop at<br>
some point in the process. At first I attributed this to the root<br>
filesystem being read-only but it continues after I move /var to a<br>
separate file system. At some point it just stops writing records to the<br>
syslog but I don't know if it's before or after the root filesystem is<br>
remounted read-only.<br>
<br>
Many of the TAs also have identical workstations in their offices. None<br>
of those machines seem to have this problem. The TAs do tend to walk<br>
away from the workstations w/o logging out. But I wrote a script to kill<br>
off their sessions and it didn't help. I had it send me an email<br>
whenever it killed somebody's session and it doesn't seem to be<br>
correlated with that. In other words, sometimes machines go down even if<br>
everyone who has used it has remembered to log out.<br>
<br>
I'm pretty desperate. Any ideas?<br>
<br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
<br>
<br>
<br>
<br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
<br><span class="HOEnZb"><font color="#888888">
</font></span></blockquote><span class="HOEnZb"><font color="#888888">
<br>
-- <br>
Todd<br>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" rel="noreferrer" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
</font></span></blockquote></div><br></div>