<html><head></head><body><div>And by running his projects outside of condor, he wasn't subject to being "scheduled" and condor couldn't properly adjust the scheduling for everyone else.</div><div><br></div><div>Yep. And he's probably a senior boss type person. He's also a lazy researcher who doesnnn't want to learn how to use the new tools. Probably wrote a bunch of code in fortran "back in the day" and the central portion can't be paralellized due to being too many nested loops for auto-parallelizing tools and rewriting the mess into another format is "inconceivable". That said, by hogging all the machines and making the other users projects have to wait, it justifies getting more cluster nodes every year or so :-)</div><div><br></div><div>Take one of the older clusters, modernize the software and make it all his. Show how running across all 40 nodes is faster than using 5 or 6 new ones.</div><div><br></div><div>On Tue, 2015-10-27 at 10:47 -0500, Todor Fassl wrote:</div><blockquote type="cite"><pre>Man, I wish I had your "hand" (Seinfeld reference). I'd get fired if I 

tried that.

We had another guy who kept running his code on like 5 or 6 different 

machines at a time.  I kept trying to steer him toward condor.  He 

insisted condor wouldn't work for him. How can it not?

On 10/27/2015 10:38 AM, Jim Kinney wrote:

<blockquote type="cite">

I implemented a cron job to delete scratch data created over 30 days

ago. That didn't go well with the people who were eating up all space

and not paying for hard drives. So I gave them a way to extend

particular areas up to 90 days. Day 91 it was deleted. So they wrote a

script to copy their internet archive around every 2 weeks to keep the

creation date below the 30 day cut off. So I shrunk the partition of

/scratch to about 10G larger than was currently in use. He couldn't do

his runs to graduate in time without cleaning up his mess. It also

pissed off other people and they yelled at him when I gave my report of

who the storage hog was.

On October 27, 2015 11:24:48 AM EDT, Todor Fassl &lt;<a href="mailto:fassl.tod@gmail.com">fassl.tod@gmail.com</a>&gt;

wrote:

    I dunno.  First of all, I don't have any details on what's going on on

    the HPC cluster. All I know is the researcher says he needs to back up

    his  3T of scratch data because they are telling him it will be erased

    when they upgrade something or other. Also, I don't know how you can

    have 3T of scratch data or why, if it's scratch data, it can't just be

    deleted. I come across this all the time though. Researchers pretty

    regularly generate 1T+ of what they insist is scratch data.

    In fact, I've had this discussion with this very same researcher. He's

    not the only one who does this but he happens to be the guy who i last

    questioned about it. You know this "scratch" space isn't backed up or

    anything. If the NAS burns up or if you type in the wrong rm command,

    it's gone. No problem, it's just scratch data. Well, then how come I

    can't just delete it when I want to re-do the network storage

    device?

    They get mad if you push them too hard.

    On 10/27/2015 09:45 AM, Jim Kinney wrote:

        Dumb question: Why is data _stored_ on an HPC cluster? The

        storage for

        an HPC should be a separate entity entirely. It's a High Performance

        cluster, not a Large Storage cluster. Ideally, a complete

        teardown and

        rebuild of an HPC should have exactly zero impact on the HPC users'

        data. Any data kept on the local space of an HPC is purely

        scratch/temp

        data and is disposable with the possible exception of checkpoint

        data

        and that should be written back to the main storage and deleted

        once the

        full run is completed.

        On Tue, 2015-10-27 at 08:33 -0500, Todor Fassl wrote:

            One of the researchers I support wants to backup 3T of data

            to his space

            on our NAS. The data is on an HPC cluster on another

            network. It's not

            an on-going backup. He just needs to save it to our NAS

            while the HPC

            cluster is rebuilt. Then he'll need to copy it right back.

            There is a very stable 1G connection between the 2 networks.

            We have

            plenty of space on our NAS. What is the best way to do the

            caopy?

            Ideally, it seems we'd want to have boththe ability to

            restart the copy

            if it fails part way through and to end up with a compressed

            archive

            like a tarball. Googling around tends to suggest that it's

            eitehr rsync

            or tar. But with rsync, you wouldn't end up with a tarball.

            And with

            tar, you can't restart it in the middle. Any other ideas?

            Since the network connection is very stable, I am thinking

            of suggesting

            tar.

            tar zcvf - /datadirectory | <a href="mailto:sshuser@backup.server">sshuser@backup.server</a>

            &lt;<a href="mailto:user@backup.server">mailto:user@backup.server</a>&gt; "cat &gt; backupfile.tgz"

            If the researcher would prefer his data to be copied to our

            NAS as

            regular files, just use rsync with compression. We don't

            have an rsync

            server that is accessible to the outside world. He could use

            ssh with

            rsync but I could set up rsync if it would be worthwhile.

            Ideas? Suggestions?

            on at the far end.

            He is going to need to copy the data back in a few weeks. It

            might even

            be worthwhile to send it via tar without

            uncompressing/unarchiving it on

            receiving end.

            ------------------------------------------------------------------------

            Ale mailing list

            <a href="mailto:Ale@ale.org">Ale@ale.org</a> &lt;<a href="mailto:Ale@ale.org">mailto:Ale@ale.org</a>&gt;

            <a href="http://mail.ale.org/mailman/listinfo/ale">http://mail.ale.org/mailman/listinfo/ale</a>

            See JOBS, ANNOUNCE and SCHOOLS lists at

            <a href="http://mail.ale.org/mailman/listinfo">http://mail.ale.org/mailman/listinfo</a>

        --

        James P. Kinney III

        Every time you stop a school, you will have to build a jail.

        What you

        gain at one end you lose at the other. It's like feeding a dog

        on his

        own tail. It won't fatten the dog.

        - Speech 11/23/1900 Mark Twain

        <a href="http://heretothereideas.blogspot.com/">http://heretothereideas.blogspot.com/</a>

        ------------------------------------------------------------------------

        Ale mailing list

        <a href="mailto:Ale@ale.org">Ale@ale.org</a>

        <a href="http://mail.ale.org/mailman/listinfo/ale">http://mail.ale.org/mailman/listinfo/ale</a>

        See JOBS, ANNOUNCE and SCHOOLS lists at

        <a href="http://mail.ale.org/mailman/listinfo">http://mail.ale.org/mailman/listinfo</a>

--

Sent from my Android device with K-9 Mail. Please excuse my brevity.

</blockquote>

</pre></blockquote></body></html>