<html><head></head><body><div>And by running his projects outside of condor, he wasn't subject to being "scheduled" and condor couldn't properly adjust the scheduling for everyone else.</div><div><br></div><div>Yep. And he's probably a senior boss type person. He's also a lazy researcher who doesnnn't want to learn how to use the new tools. Probably wrote a bunch of code in fortran "back in the day" and the central portion can't be paralellized due to being too many nested loops for auto-parallelizing tools and rewriting the mess into another format is "inconceivable". That said, by hogging all the machines and making the other users projects have to wait, it justifies getting more cluster nodes every year or so :-)</div><div><br></div><div>Take one of the older clusters, modernize the software and make it all his. Show how running across all 40 nodes is faster than using 5 or 6 new ones.</div><div><br></div><div>On Tue, 2015-10-27 at 10:47 -0500, Todor Fassl wrote:</div><blockquote type="cite"><pre>Man, I wish I had your "hand" (Seinfeld reference). I'd get fired if I
tried that.
We had another guy who kept running his code on like 5 or 6 different
machines at a time. I kept trying to steer him toward condor. He
insisted condor wouldn't work for him. How can it not?
On 10/27/2015 10:38 AM, Jim Kinney wrote:
<blockquote type="cite">
I implemented a cron job to delete scratch data created over 30 days
ago. That didn't go well with the people who were eating up all space
and not paying for hard drives. So I gave them a way to extend
particular areas up to 90 days. Day 91 it was deleted. So they wrote a
script to copy their internet archive around every 2 weeks to keep the
creation date below the 30 day cut off. So I shrunk the partition of
/scratch to about 10G larger than was currently in use. He couldn't do
his runs to graduate in time without cleaning up his mess. It also
pissed off other people and they yelled at him when I gave my report of
who the storage hog was.
On October 27, 2015 11:24:48 AM EDT, Todor Fassl <<a href="mailto:fassl.tod@gmail.com">fassl.tod@gmail.com</a>>
wrote:
I dunno. First of all, I don't have any details on what's going on on
the HPC cluster. All I know is the researcher says he needs to back up
his 3T of scratch data because they are telling him it will be erased
when they upgrade something or other. Also, I don't know how you can
have 3T of scratch data or why, if it's scratch data, it can't just be
deleted. I come across this all the time though. Researchers pretty
regularly generate 1T+ of what they insist is scratch data.
In fact, I've had this discussion with this very same researcher. He's
not the only one who does this but he happens to be the guy who i last
questioned about it. You know this "scratch" space isn't backed up or
anything. If the NAS burns up or if you type in the wrong rm command,
it's gone. No problem, it's just scratch data. Well, then how come I
can't just delete it when I want to re-do the network storage
device?
They get mad if you push them too hard.
On 10/27/2015 09:45 AM, Jim Kinney wrote:
Dumb question: Why is data _stored_ on an HPC cluster? The
storage for
an HPC should be a separate entity entirely. It's a High Performance
cluster, not a Large Storage cluster. Ideally, a complete
teardown and
rebuild of an HPC should have exactly zero impact on the HPC users'
data. Any data kept on the local space of an HPC is purely
scratch/temp
data and is disposable with the possible exception of checkpoint
data
and that should be written back to the main storage and deleted
once the
full run is completed.
On Tue, 2015-10-27 at 08:33 -0500, Todor Fassl wrote:
One of the researchers I support wants to backup 3T of data
to his space
on our NAS. The data is on an HPC cluster on another
network. It's not
an on-going backup. He just needs to save it to our NAS
while the HPC
cluster is rebuilt. Then he'll need to copy it right back.
There is a very stable 1G connection between the 2 networks.
We have
plenty of space on our NAS. What is the best way to do the
caopy?
Ideally, it seems we'd want to have boththe ability to
restart the copy
if it fails part way through and to end up with a compressed
archive
like a tarball. Googling around tends to suggest that it's
eitehr rsync
or tar. But with rsync, you wouldn't end up with a tarball.
And with
tar, you can't restart it in the middle. Any other ideas?
Since the network connection is very stable, I am thinking
of suggesting
tar.
tar zcvf - /datadirectory | <a href="mailto:sshuser@backup.server">sshuser@backup.server</a>
<<a href="mailto:user@backup.server">mailto:user@backup.server</a>> "cat > backupfile.tgz"
If the researcher would prefer his data to be copied to our
NAS as
regular files, just use rsync with compression. We don't
have an rsync
server that is accessible to the outside world. He could use
ssh with
rsync but I could set up rsync if it would be worthwhile.
Ideas? Suggestions?
on at the far end.
He is going to need to copy the data back in a few weeks. It
might even
be worthwhile to send it via tar without
uncompressing/unarchiving it on
receiving end.
------------------------------------------------------------------------
Ale mailing list
<a href="mailto:Ale@ale.org">Ale@ale.org</a> <<a href="mailto:Ale@ale.org">mailto:Ale@ale.org</a>>
<a href="http://mail.ale.org/mailman/listinfo/ale">http://mail.ale.org/mailman/listinfo/ale</a>
See JOBS, ANNOUNCE and SCHOOLS lists at
<a href="http://mail.ale.org/mailman/listinfo">http://mail.ale.org/mailman/listinfo</a>
--
James P. Kinney III
Every time you stop a school, you will have to build a jail.
What you
gain at one end you lose at the other. It's like feeding a dog
on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain
<a href="http://heretothereideas.blogspot.com/">http://heretothereideas.blogspot.com/</a>
------------------------------------------------------------------------
Ale mailing list
<a href="mailto:Ale@ale.org">Ale@ale.org</a>
<a href="http://mail.ale.org/mailman/listinfo/ale">http://mail.ale.org/mailman/listinfo/ale</a>
See JOBS, ANNOUNCE and SCHOOLS lists at
<a href="http://mail.ale.org/mailman/listinfo">http://mail.ale.org/mailman/listinfo</a>
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
</blockquote>
</pre></blockquote></body></html>