[ale] Lab Workstation Mystery

Mon Mar 28 13:15:26 EDT 2016

I've seen this exact issue before on centos systems. My memory is just not
providing a clue on what that was.
On Mar 28, 2016 1:06 PM, "Todor Fassl" <fassl.tod at gmail.com> wrote:

> The partitions are not full. That was the first thing I checked.  I
> thought maybee someone was filling up /tmp. But I sshed to a machine while
> the root fs was mounted read-only and it wasn't full. I even set up nagios
> to watch the percent full via snmp and it never showed a problem.
>
> The root filesystem, /tmp,  and /var are on seperate partitions. But the
> reason the logs just cut off may be that when it remounts / read-only, it
> also remounts /var read-only. There is nothing unusual in the logs and then
> they just stop.
>
> On the other hand, while my memory is a little fuzzy but I am pretty sure
> that I was able to mount an nfs share from a different server when a
> machine was in the in-between state. Somebody might have been logged into
> it at the time. I'm doing all this remotely because the machines are
> several floors away.  But if it was unable to write log records, I
> shouldn't have been able to do an nfs mount at all. Next time I catch a
> machine in the in-between state, I'll check if /tmp and /var are munted
> read-only.
>
>
>
>
>
>
> On 03/28/2016 11:34 AM, Jim Kinney wrote:
>
>> The root dir is NOT NFS mounted so that's a red-herring that you can't
>> mount the /home later. If /var is not writeable, the system will hang
>> as it can't log any more. Mounting requires a log entry
>> Since it's not happening all at once to all the machines it really
>> smells like a local machine problem. Verify that the drive is not full.
>> Check to see if the affected machines are on the power circuit.
>> Is it the same 2-3 each time? If so, run memtest and badblocks. If swap
>> gets corrupted, Linux system lock up.
>> On Mon, 2016-03-28 at 10:54 -0500, Todor Fassl wrote:
>>
>>> I have a mysterious problem with workstations in a shared use
>>> environment. There are 2 labs in different buildings, onewith 6
>>> workstations and one with 8. These workstations are used by a group
>>> of
>>> about 30 grad student TAs. All are running ubuntu 15.10.
>>> Authentication
>>> is via ldap and home directories are mounted  via nfs.  Every day, 2
>>> or
>>> 3 of the machines go down. The earliest symptom I can find is that
>>> the
>>> root filesystem is remounted read-only.  Soon they stop responding
>>> to
>>> ssh and snmp and they are essentially locked up. They still respond
>>> to
>>> pings though.
>>>
>>> I've caught the machines in the period where the root system is
>>> read-only but I can still ssh to them. I've found that I cannot nfs
>>> mount home directories on our file server.  I can mount nfs shares
>>> on
>>> other servers. And I can mount the same home directories if I go to
>>> another workstation. Restarting nfs on the file server has no effect.
>>>
>>> When I try to mount a home directory on an effected machine, the
>>> mount
>>> just hangs.  I ran it with strace and it just showed it was waiting
>>> --
>>> for what, I'm not sure and I don't have a screen cap available at
>>> the
>>> moment. I put a packet sniffer on the server and it showed it
>>> received a
>>> single packet from the client and that's it.
>>>
>>> There is nothing in the logs on the client. In fact, they simply stop
>>> at
>>> some point in the process. At first I attributed this to the root
>>> filesystem being read-only but it continues after I move /var to a
>>> separate file system. At some point it just stops writing records to
>>> the
>>> syslog but I don't know if it's before or after the root filesystem
>>> is
>>> remounted read-only.
>>>
>>> Many of the TAs also have identical workstations in their offices.
>>> None
>>> of those machines seem to have this problem.  The TAs do tend to
>>> walk
>>> away from the workstations w/o logging out. But I wrote a script to
>>> kill
>>> off their sessions and it didn't help. I had it send me an email
>>> whenever it killed somebody's session and it doesn't seem to be
>>> correlated with that. In other words, sometimes machines go down even
>>> if
>>> everyone who has used it has remembered to log out.
>>>
>>> I'm pretty desperate. Any ideas?
>>>
>>> _______________________________________________
>>> Ale mailing list
>>> Ale at ale.org
>>> http://mail.ale.org/mailman/listinfo/ale
>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> http://mail.ale.org/mailman/listinfo
>>>
>>>
>>> _______________________________________________
>>> Ale mailing list
>>> Ale at ale.org
>>> http://mail.ale.org/mailman/listinfo/ale
>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> http://mail.ale.org/mailman/listinfo
>>>
>>
> --
> Todd
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20160328/a2788666/attachment.html>