[ale] Lab Workstation Mystery

Jim Kinney jim.kinney at gmail.com
Mon Mar 28 12:34:36 EDT 2016


The root dir is NOT NFS mounted so that's a red-herring that you can't
mount the /home later. If /var is not writeable, the system will hang
as it can't log any more. Mounting requires a log entry
Since it's not happening all at once to all the machines it really
smells like a local machine problem. Verify that the drive is not full.
Check to see if the affected machines are on the power circuit. 
Is it the same 2-3 each time? If so, run memtest and badblocks. If swap
gets corrupted, Linux system lock up.
On Mon, 2016-03-28 at 10:54 -0500, Todor Fassl wrote:
> I have a mysterious problem with workstations in a shared use 
> environment. There are 2 labs in different buildings, onewith 6 
> workstations and one with 8. These workstations are used by a group
> of 
> about 30 grad student TAs. All are running ubuntu 15.10.
> Authentication 
> is via ldap and home directories are mounted  via nfs.  Every day, 2
> or 
> 3 of the machines go down. The earliest symptom I can find is that
> the 
> root filesystem is remounted read-only.  Soon they stop responding
> to 
> ssh and snmp and they are essentially locked up. They still respond
> to 
> pings though.
> 
> I've caught the machines in the period where the root system is 
> read-only but I can still ssh to them. I've found that I cannot nfs 
> mount home directories on our file server.  I can mount nfs shares
> on 
> other servers. And I can mount the same home directories if I go to 
> another workstation. Restarting nfs on the file server has no effect.
> 
> When I try to mount a home directory on an effected machine, the
> mount 
> just hangs.  I ran it with strace and it just showed it was waiting
> -- 
> for what, I'm not sure and I don't have a screen cap available at
> the 
> moment. I put a packet sniffer on the server and it showed it
> received a 
> single packet from the client and that's it.
> 
> There is nothing in the logs on the client. In fact, they simply stop
> at 
> some point in the process. At first I attributed this to the root 
> filesystem being read-only but it continues after I move /var to a 
> separate file system. At some point it just stops writing records to
> the 
> syslog but I don't know if it's before or after the root filesystem
> is 
> remounted read-only.
> 
> Many of the TAs also have identical workstations in their offices.
> None 
> of those machines seem to have this problem.  The TAs do tend to
> walk 
> away from the workstations w/o logging out. But I wrote a script to
> kill 
> off their sessions and it didn't help. I had it send me an email 
> whenever it killed somebody's session and it doesn't seem to be 
> correlated with that. In other words, sometimes machines go down even
> if 
> everyone who has used it has remembered to log out.
> 
> I'm pretty desperate. Any ideas?
> 
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
-- 
James P. Kinney III

Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20160328/d8d8281e/attachment.html>


More information about the Ale mailing list