[ale] HPC 101?

Sun Dec 8 10:32:35 EST 2024

You just missed Super Computing 24 in Atlanta.

A major component of hoc is the ability to coordinate computation across
multiple physical compute nodes. This is coordinated by message passing,
mpi, with mpich being a top contender.

Once the compute nodes are running the same code all at once, the big
bottleneck is next - filesystem IO.  Big filesystems, petabyte range, have
challenges with locking, and all have a hard time with small files. With
new drives defaulting to a 4k block size and many hundreds or thousands of
data files in the 1-2k size, the metadata operations turn into a choke
point. This is mostly a coding problem as devs don't always write for the
specific hardware nearly as well as is needed. The analogy I used once was
"unless the engine in that Ferrari was designed for regular gas, you will
not get Ferrari performance if you fill it with QT regular gas".

Monitoring tools are essential. You need to know if a node is being slow as
mpi runs only as fast as the slowest node.

Job scheduling - slurm. Don't waste time with anything else unless a
hyperspecific need for another scheduler is made mandatory. In that case,
push back like mad untill slurm is chosen anyway. All schedulers are broken
in some way. But slurm solved the brokenness of all the others to make it's
own little problems.

On Sun, Dec 8, 2024, 8:38 AM Leam Hall via Ale <ale at ale.org> wrote:

> I'd like to learn more about HPC and Linux. Anyone have resources to share?
>
> Thanks!
>
> Leam
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20241208/4c9c5f55/attachment.htm>