[ale] diagnosis

James P. Kinney III jkinney at localnetsolutions.com
Sat Apr 24 16:04:22 EDT 2004


On Sat, 2004-04-24 at 06:58, David Corbin wrote:
> On Friday 23 April 2004 21:12, James P. Kinney III wrote:
> > Then stop worrying. The 2.4.18+ kernels have some rather aggressive
> > memory caching. The buffers grow as the system is used to make data
> > access faster on a reuse of old data. Some apps (mozilla) don't release
> > buffer ram as fast as the chance of reuse is high. If the system needs
> > more non-buffered ram, the buffers get dumped out to disk cache. Top is
> > also a buffering hog.
> >
> > To really test the system for nefarious RAM usage, telinit S to drop to
> > single user mode, then run the clean top. If that is OK, bump up to
> > telinit 2, then 3 and finally 5. There is a host of stuff that runs with
> > X that can be causing cache to grow over time while essentially "doing
> > nothing". It really and "undocumented feature" of gnome, KDE, and X :)
> >
> 
> The "investigation"  I ran yesterday *was* in single user mode.  And to keep 
> things fresh in your memory, as soon as the /var/run/utmp file exists (even 
> in single user mode),  memory starts disappearing from free to be used by 
> buffers.  If that file is not there (when I mount /var) I do not see evidence 
> of the memory leak.  I've never let it exhaust memory while in single user 
> mode, but at run-level 2 (normal) it eventually runs out of memory to 
> allocate.  I wouldn't really says the system crashes, but none of the 
> applicatoins can operate as no RAM is available for them.

Well, utmp is a storage area for logins and usage info. It that file is
growing in single user mode with nothing else running, you have a
problem. The kernel should be what is generating the data for the utmp
file. Since the presence of utmp initiates the memory loss, I would
suspect that kernel is corrupted and is not flushing the write to utmp
and is instead buffering the write process and/or data. This may
indicate a bad hard drive, trojaned kernel or failing RAM.

Run memtest and rule out that. Then copy a kernel from a CD distribution
and set lilo/grub to use that kernel. Then boot to single user, touch
utmp, reboot back to single user with the same CD kernel and watch the
top process. If there is still the problem, drop in an other hard drive,
make it the /var partition, and try again.

If all that fails, get a Geiger counter and start looking for a
radiation source that can cause bit flips :)
> 
> > On Fri, 2004-04-23 at 17:37, David Corbin wrote:
> > > I tried it with the "safe" version of top.  It shows nothing that isn't
> > > in my regular top.  However, I did try "vmstat" which was there.  It
> > > shows that the free memory is disappear as the "buffers" is growing.
> > >
> > > Does that help any?
> > >
> > > On Monday 19 April 2004 20:35, James P. Kinney III wrote:
> > > > I put up a page with the binaries and source on it :
> > > >
> > > > http://www.localnetsolutions.com/tools/
> > > >
> > > > Note: the procps page on sourceforge did not have an md5 checksum.
> > > >
> > > > On Mon, 2004-04-19 at 20:02, David Corbin wrote:
> > > > > On Monday 19 April 2004 15:01, James P. Kinney III wrote:
> > > > > > If it is a cracked machine, running a statically linked top from a
> > > > > > CD will gain access to the real top data. Top is a common binary to
> > > > > > fiddle with with a root kit.
> > > > >
> > > > > Sounds reasonable.  Can you point me at such, or if not that, anybody
> > > > > got any idea where the source to top is and I'll build my own.
> > > > >
> > > > > > It is certainly possible to _add_ a module or _remove_ a module,
> > > > > > but change out the kernel with out a reboot (unless 2-kernel-monte
> > > > > > is available, I have not been able to find this :(  ). So the
> > > > > > actual data stream for top is not tamper-able easily. Thus a known
> > > > > > good statically-linked top would give access to the running system
> > > > > > and show the _real_ processes that are running.
> > > > > >
> > > > > > If top shows no malicious files, it's time to take some snapshots
> > > > > > over time to plot which app is failing.
> > > > > >
> > > > > > #!/bin/sh
> > > > > > echo date >> /tmp/top.txt
> > > > > > top -b -n 1 -c >> /tmp/top.txt
> > > > > > echo "###############" >>/tmp/top.txt
> > > > > > echo >>/tmp/top.txt
> > > > > > echo >>/tmp/top.txt
> > > > > >
> > > > > > Run as a cron every minute for an hour.
> > > > > >
> > > > > > If you want, you can now mash/mangle the data into a nice plot
> > > > > > using some perl and gnplot (or a spreadsheet).
> > > > > >
> > > > > > On Mon, 2004-04-19 at 11:56, Geoffrey wrote:
> > > > > > > Dow Hurst wrote:
> > > > > > > > How can we find the process that is soaking the memory?  How do
> > > > > > > > you manipulate /proc to find out the originating process that
> > > > > > > > owns the memory being used?  I know IRIX had tools to look at
> > > > > > > > memory and see which processes owned what part of memory.  Does
> > > > > > > > Linux?
> > > > > > > >
> > > > > > > > Seems if you knew what was leaking you would have a major part
> > > > > > > > of the battle won.
> > > > > > >
> > > > > > > I believe we mentioned top, but he noted that doesn't give him
> > > > > > > anything. That's what concerns me.  If it doesn't show, is it
> > > > > > > being hidden for a reason???
> > >
> > > _______________________________________________
> > > Ale mailing list
> > > Ale at ale.org
> > > http://www.ale.org/mailman/listinfo/ale
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale
-- 
James P. Kinney III          \Changing the mobile computing world/
CEO & Director of Engineering \          one Linux user         /
Local Net Solutions,LLC        \           at a time.          /
770-493-8244                    \.___________________________./
http://www.localnetsolutions.com

GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
<jkinney at localnetsolutions.com>
Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part




More information about the Ale mailing list