<div dir="ltr"><div><div><div>The politics can get messy. Jeffry's later post of providing data of the hog issue is very correct.<br><br></div>I use ganglia to provide real time display of cluster usage (RAM, CPU, networking, adding GPU now).<br><br></div>I guess I'm pretty lucky as I'm also "just the IT guy" but I get to make it plain that my job is to help them graduate. Yes, I do spend time individually helping each student learn how to not break things. I also make it very plain that a system crash is an extreme failure on their part. Sure, I have to reboot a machine (YAY addressable PDUs and IPMI!) but it breaks _their_ work worse. My current quest is to beat them all with the clue-by-four of parallelism. LEARN how to think in parallel processes. LEARN how to write code that can support multiple threads. LEARN how to split large data sets into chunks that can be processed by multiple systems/cores/nodes/gpus, etc.<br><br></div><div>latest fun: machine learning on image analysis for eye tracking from a video for ADHD work - generates video with 15K frames; each frame has a data set of eye position in pixel coordinates per eye; process was trained on worst design at all - each frame is cropped to generate an enlarged image of each individual eye - that's now run again to determine gaze direction - 15,000 frames -> 30,000 images => all single threaded. <sigh><br></div><div><br></div><rant> I don't come from a comp-sci background so I've had to figure out a lot on my own. It seems the younger programmers are more and more disconnected from the reality of the hardware they use. "Load this data set and start my algorithm" is the mindset. The engineering mentality of HOW to do the process using both hardware and software is missing. </rant><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 5, 2017 at 9:27 AM, Todor Fassl <span dir="ltr"><<a href="mailto:fassl.tod@gmail.com" target="_blank">fassl.tod@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Right, Jim, another aspect of this problem is that most of the students don't even realize they need to be careful, much less how to be careful. "What? Is there a problem with me asking for 500 gigabytes of ram?" Well, the machine has only 256. But I'm just the IT guy and it's not my place to demand that these students demonstrate a basic understanding of sharing resources before getting started. The instructors would never go for that. I am pretty much stuck providing that informally on a one-to-one basis. But I think it would be valuable for me to work on automating that somehow. Pointers to the wiki, stuff like that.<br>
<br>
Somebody emailled me off list and made a really good point. The key, I think is information. Well, that and peer pressure. I know nagios can trigger an alert when a machine runs low on ram or cpu cycles. It might even be able to determine who is running the procs that are causing it. I can at least put all the users in a nagios group and send them alerts when a research server is near an OOM event. I'll have to see what kind of granularity I can get out of nagios and experiment with who gets notified. I can do things like keep widening the group that gets notified of an event if the original setup turns out to be ineffective.<br>
<br>
This list has really come through for me again just with ideas I can bounce around. I'll have to tread lightly though. About a year ago, I configured the machines in our shared labs to log someone off after 15 minutes of inactivity. Believe it or not, that was controversial. Not with the faculty but with the students using the labs. It was an easy win for me but some of the students went to the faculty with complaints. Wait, you're actually defending your right to walk away from a workstation in a public place still logged in? In a way that's not such a bad thing. This is a university and the students should run the place. But they need a referee.<div class="HOEnZb"><div class="h5"><br>
<br>
<br>
<br>
<br>
On 10/05/2017 06:52 AM, Jim Kinney wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Back to the original issue:<br>
<br>
A tool like torque or slurm is really your best solution to intensive shared resources. It prevents 2 big jobs from eating the same machine and can also encourage users to code better to manage resources better so they can run more jobs.<br>
<br>
I have the same problem. One heavy gpu machine (4 tesla P100) only has 64 G ram. Student tried to load in 200+G of data into ram.<br>
<br>
A few crashes later he can run 2 jobs at once, each only eats 30G ram and one p100.<br>
<br>
On October 4, 2017 6:32:32 PM EDT, Todor Fassl <<a href="mailto:fassl.tod@gmail.com" target="_blank">fassl.tod@gmail.com</a>> wrote:<br>
<br>
I manage a group of research servers for grad students at a university.<br>
The grad students use these machines to do the research for their Ph.D<br>
theses. The problem is that they pretty regularly kill off each other's<br>
programs by using up all the ram. Most of the machines have 256G of ram.<br>
One kid uses 200Gb and another 100Gb and one or the other, often both,<br>
die. Sometimes they bringthe machines down by hogging the cpu or using<br>
up all the ram. Well, the machines never crash but they might as well be<br>
down.<br>
<br>
We really, really don't want to force them to use a scheduling system<br>
like slurm. They are just learnng and they might run the same piece of<br>
code 20 times in an hour.<br>
<br>
Is there a way to set a limit on the amount of ram all of a user's<br>
processes can use? If so, we were thinking of setting it at 50% of the<br>
on-board ram. Then it would take 3 students together to trash a machine.<br>
It might still happen but it would be a lot more infrequent.<br>
<br>
Any other suggestions? Anything at all? Just keep in mind that we really<br>
want to keep it easy for the students to play around.<br>
<br>
<br>
-- <br>
Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity.<br>
</blockquote>
<br></div></div><span class="HOEnZb"><font color="#888888">
-- <br>
Todd<br>
</font></span></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">-- <br>James P. Kinney III<br><i><i><i><i><br></i></i></i></i>Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.<br>
- Speech 11/23/1900 Mark Twain<br><i><i><i><i><br><a href="http://heretothereideas.blogspot.com/" target="_blank">http://heretothereideas.blogspot.com/</a><br></i></i></i></i></div></div>
</div>