[ale] shared research server help
Jim Kinney
jim.kinney at gmail.com
Thu Oct 5 07:52:15 EDT 2017
Back to the original issue:
A tool like torque or slurm is really your best solution to intensive shared resources. It prevents 2 big jobs from eating the same machine and can also encourage users to code better to manage resources better so they can run more jobs.
I have the same problem. One heavy gpu machine (4 tesla P100) only has 64 G ram. Student tried to load in 200+G of data into ram.
A few crashes later he can run 2 jobs at once, each only eats 30G ram and one p100.
On October 4, 2017 6:32:32 PM EDT, Todor Fassl <fassl.tod at gmail.com> wrote:
>I manage a group of research servers for grad students at a university.
>
>The grad students use these machines to do the research for their Ph.D
>theses. The problem is that they pretty regularly kill off each other's
>
>programs by using up all the ram. Most of the machines have 256G of
>ram.
>One kid uses 200Gb and another 100Gb and one or the other, often both,
>die. Sometimes they bringthe machines down by hogging the cpu or using
>up all the ram. Well, the machines never crash but they might as well
>be
>down.
>
>We really, really don't want to force them to use a scheduling system
>like slurm. They are just learnng and they might run the same piece of
>code 20 times in an hour.
>
>Is there a way to set a limit on the amount of ram all of a user's
>processes can use? If so, we were thinking of setting it at 50% of the
>on-board ram. Then it would take 3 students together to trash a
>machine.
>It might still happen but it would be a lot more infrequent.
>
>Any other suggestions? Anything at all? Just keep in mind that we
>really
>want to keep it easy for the students to play around.
>
>
>--
>Todd
>_______________________________________________
>Ale mailing list
>Ale at ale.org
>http://mail.ale.org/mailman/listinfo/ale
>See JOBS, ANNOUNCE and SCHOOLS lists at
>http://mail.ale.org/mailman/listinfo
--
Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20171005/08107a6e/attachment.html>
More information about the Ale
mailing list