[ale] How to debug a program that just goes away
Doug McNash
dmcnash at charter.net
Sun Feb 28 15:49:31 EST 2010
---- Jim Lynch <ale_nospam at fayettedigital.com> wrote:
> David Tomaschik wrote:
> > Jim Lynch wrote:
> >
> >> I have a multi-threaded c++ program that occasionally just stops
> >> running. At the time it stops it is usually not doing anything. Every
> >> thread is either waiting on a semaphore or sleeping (Thread::sleep).
> >> It's event driven and no events have arrived for some time. I have lots
> >> of prints to be able to tell where it is and what it's doing. No core
> >> file generated. No strange messages in any log file, either system or
> >> application. No rogue processes killing it off.
> >>
> >> The program runs successfully on multiple other machines but not this
> >> one. It's a newer system than the others. I recompiled on this system,
> >> thinking it may help but no. Access to this system is limited to two
> >> people, myself and one other. I trust him since he's got more to lose
> >> than I do if it doesn't work. I can work around it with a wrapper,
> >> restarting when it fails, but I'd really like to understand how it's
> >> happening.
> >>
> >> I have ulimit -c 50000 in the script that runs it, so a core will be
> >> generated if it aborts. I trap SIGHUP, SIGINT, SIGCHLD and SIGQUIT and
> >> will see something in the log file if a signal is trapped. It's on a
> >> Centos 4.7 system. Same OS as the other running systems. The only
> >> difference is that this is a newer dual core system. Considerably
> >> faster also. I've run both a conventional kernel and an openvz kernel.
> >> I'm compiling with " -g -O2" flags.
> >>
> >> I have no idea how to proceed from here. Can anyone suggest something I
> >> could do to find out what's the cause?
> >>
> >> Thanks,
> >> Jim.
> >>
> >>
> > Have you considered running the app through gdb? Multi-threaded apps
> > are a bit more difficult to debug than gdb, but not impossibly so.
> >
> > David
> >
> >
> I have, but I didn't learn anything. When it dies nothing is available
> to inspect. While it's running everything seems to be OK. There are 7
> permanent threads with one or two transient threads run and destroyed.
>
> This program runs fine for many days on other hardware. It's just these
> two systems that gives it grief. I run it locally on a dual core system
> and it works fine here. They have two identical systems and it fails on
> both. I've run both the standard Centos kernel and the openvz kernel.
> Nothing seems to affect it.
>
> Jim.
>
Since it's not leaving a core and all signals are caught, the only mechanism I know left is the OOM killer(out of memory). If as you say it is a large program sitting idle, that would tend to raise it its /proc/<pid>/oom_score. You can lower it's oom_score with echo -17 > /proc/<pid>/oom_adj where it will never be killed. If you do that either you will get a panic or some other process will have to give up it's life.
see:
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.0/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Swapping_and_Out_Of_Memory_Tips.html
--
doug mcnash
More information about the Ale
mailing list