<html><head></head><body>That's the same test I'll be doing tomorrow on a real node. I have a small numpy app that generates a 10,000x10,000 random matrix then inverts it, does some other easy but parallelized math. It will eat as many cpu cores as I give it. 10k slurps a boatload of ram.<br><br>So my plan is to find out which pid to send that STOP to, the shepherd or the actual job monitored by the shepherd. I'm betting on the job.<br><br>If this works reliably, I can add pre and post scripts to stop then continue other jobs.<br><br>Or I just hork the master up and tomorrow blows up in my face. Good times!<br><br>Oh, ram and swap. I'm letting the kernel deal with that. My test is basically two of the numpy jobs so the paused one will have to get at least partially swapped out and back.<br><br><div class="gmail_quote">On February 8, 2021 7:43:29 PM EST, Steve Litt via Ale <ale@ale.org> wrote:<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<pre class="k9mail">On Mon, 08 Feb 2021 14:13:55 -0500<br>Jim Kinney via Ale <ale@ale.org> wrote:<br><br><br><blockquote class="gmail_quote" style="margin: 0pt 0pt 1ex 0.8ex; border-left: 1px solid #729fcf; padding-left: 1ex;">I want to send Bob's job a SIGSTOP and let Mary's job run to<br>completion. Then send a SIGCONT and Bob is back running.<br></blockquote><br>I just created a 2.7GB text file, called junk.jnk, that has 293 million<br>lines. I ran gkrellm and then ran the following:<br><br>sort junk.jnk<br>kill -SIGSTOP 18665<br><br>CPU usage immediately dropped from 66% to about 2%. A few seconds later<br>I did:<br><br>kill -SIGCONT 18665<br><br>CPU usage went back up to 66%. So based on that, it seems like the<br>STOP/CONT combination works well. I think Bob's job would eventually<br>swap out. If you REALLY want to swap it out quickly, you could write a<br>C program that does nothing but malloc() and copy bogus bytes to the<br>newly allocated pointers (because without the bogus bytes, you don't<br>really consume RAM). Have it malloc() about the same amount of RAM as<br>you expect Mary's process will need. Then free() it all and exit. I<br>suspect this quick program would cause Bob's stopped program to swap,<br>leaving the path clear for Mary's program to run.<br><br>SteveT<br><br>Steve Litt <br>Autumn 2020 featured book: Thriving in Tough Times<br><a href="http://www.troubleshooters.com/thrive">http://www.troubleshooters.com/thrive</a><hr>Ale mailing list<br>Ale@ale.org<br><a href="https://mail.ale.org/mailman/listinfo/ale">https://mail.ale.org/mailman/listinfo/ale</a><br>See JOBS, ANNOUNCE and SCHOOLS lists at<br><a href="http://mail.ale.org/mailman/listinfo">http://mail.ale.org/mailman/listinfo</a><br></pre></blockquote></div><br>-- <br>Computers amplify human error<br>Super computers are really cool</body></html>