[ale] HPC 101?

Jim Kinney jim.kinney at gmail.com
Sun Dec 8 12:55:21 EST 2024


HPC is enterprise-like computing on steroids for scale and with tighter
deadlines (milliseconds) and much worse code 😆

On Sun, Dec 8, 2024, 10:58 AM Leam Hall via Ale <ale at ale.org> wrote:

> /me scribbles notes
>
> Jim, thanks! You were the first person I thought of. I sent it to the list
> because others may have the same question. I've been on the periphery of
> HPC off and on, and have worked bigger non-HPC sites. Even did a SLES on Z
> Proof of Concept years ago.
>
> Leam
>
>
>
> On 12/8/24 09:32, Jim Kinney via Ale wrote:
> > You just missed Super Computing 24 in Atlanta.
> >
> > A major component of hoc is the ability to coordinate computation across
> > multiple physical compute nodes. This is coordinated by message passing,
> > mpi, with mpich being a top contender.
> >
> > Once the compute nodes are running the same code all at once, the big
> > bottleneck is next - filesystem IO.  Big filesystems, petabyte range,
> have
> > challenges with locking, and all have a hard time with small files. With
> > new drives defaulting to a 4k block size and many hundreds or thousands
> of
> > data files in the 1-2k size, the metadata operations turn into a choke
> > point. This is mostly a coding problem as devs don't always write for the
> > specific hardware nearly as well as is needed. The analogy I used once
> was
> > "unless the engine in that Ferrari was designed for regular gas, you will
> > not get Ferrari performance if you fill it with QT regular gas".
> >
> > Monitoring tools are essential. You need to know if a node is being slow
> as
> > mpi runs only as fast as the slowest node.
> >
> > Job scheduling - slurm. Don't waste time with anything else unless a
> > hyperspecific need for another scheduler is made mandatory. In that case,
> > push back like mad untill slurm is chosen anyway. All schedulers are
> broken
> > in some way. But slurm solved the brokenness of all the others to make
> it's
> > own little problems.
> >
> > On Sun, Dec 8, 2024, 8:38 AM Leam Hall via Ale <ale at ale.org> wrote:
> >
> >> I'd like to learn more about HPC and Linux. Anyone have resources to
> share?
> >>
> >> Thanks!
> >>
> >> Leam
> >> _______________________________________________
> >> Ale mailing list
> >> Ale at ale.org
> >> https://mail.ale.org/mailman/listinfo/ale
> >> See JOBS, ANNOUNCE and SCHOOLS lists at
> >> http://mail.ale.org/mailman/listinfo
> >>
> >
> >
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > https://mail.ale.org/mailman/listinfo/ale
> > See JOBS, ANNOUNCE and SCHOOLS lists at
> > http://mail.ale.org/mailman/listinfo
>
> --
> Linux Software Engineer   (reuel.net/resume)
> Scribe: The Domici War    (domiciwar.net)
> Coding Ne'er-do-well      (github.com/LeamHall)
>
> Between "can" and "can't" is a gap of "I don't know", a place of
> discovery. For the passionate, much of "can't" falls into "yet". -- lh
>
> Practice allows options and foresight. -- lh
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20241208/6ec2b29f/attachment.htm>


More information about the Ale mailing list