<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 9/9/18 9:35 AM, Jim Kinney wrote:<br>
</div>
<blockquote type="cite"
cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Ha!!<br>
<br>
My ultimate plan with the APS project was to cobble all the "waste
machines" together into a system-wide distributed storage cluster
then run a single-system image cluster tool like mosics or
kerrigh. That would make the entire school district a single
instance, giant super computer that K12 students use :-)<br>
</blockquote>
I haven't had occasion to use this yet but Hadoop has a thing where
you can divide your nodes into "racks" such that the ResourceManager
knows to expect network bottlenecks between groups of nodes and
moves processing and storage accordingly; in the APS environment one
would group each school into a "rack" (no idea if Hadoop supports
"racks" of "racks").<br>
<blockquote type="cite"
cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">
<br>
Excellent work on the hadoop cluster!<br>
</blockquote>
Thanks! It's been interesting and I've covered a few things along
the way I'd long wanted to be able to do (like PXE-booting), so,
bonus.<br>
<blockquote type="cite"
cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">
<br>
I have my cluster nodes set to always pxe boot and the pxe boot
default is to fall back to local boot drive. That way I can drive
a new install/rebuild by twiddling a file on the DHCP server and
rebooting a node. Eventually, the nodes will use no local storage
as all will be reserved for /tmp (raid0 across all drives for
speed) and use an NFS mounted root and remote log file process.
Basically a homegrown PAAS set up controlled by job submission
defined need. <br>
</blockquote>
I thought about running the working instance out of RAM leaving
everything NFS-based but I decided against it. For one thing, Hadoop
activity alone hammers the LAN and if a lot of additional traffic
(the Hadoop binary distribution is about 830MiB) is trying to
concentrate on, say, the edge node where the worker node instance
actually lives while there's a lot of HDFS traffic shooting from
node to node, that's not cool. <br>
<blockquote type="cite"
cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">
<br>
My current jobs are compiled matlab and custom python. Hadoop is
coming back. Can't run hadoop on the same nodes with the others as
it assumes full system control. Not enough demand yet for
dedicated hadoop stack. So a reboot to nfs root hadoop cluster
with a temp "node offline" status in torque seems to be feasible.
Use pre-run scripts in torque to call reboot to hadoop node and
post-run to reboot to normal cluster mode.<br>
</blockquote>
Sounds reasonable. <br>
<blockquote type="cite"
cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com"><br>
<div class="gmail_quote">On September 7, 2018 11:54:04 PM EDT,
Jeff Hubbs via Ale <a class="moz-txt-link-rfc2396E" href="mailto:ale@ale.org"><ale@ale.org></a> wrote:
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div class="moz-cite-prefix">On 9/7/18 4:52 PM, dev null zero
two wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CABmokzAEgNTa99n0uA1xiS5nKC92MxZ5E4xRmmHHTYFaB_Mx0A@mail.gmail.com">
<meta http-equiv="content-type" content="text/html;
charset=utf-8">
<div>
<div dir="auto">that is pretty incredible (never thought
to use Gentoo for this purpose).</div>
</div>
</blockquote>
I use it for everything. :)<br>
<blockquote type="cite"
cite="mid:CABmokzAEgNTa99n0uA1xiS5nKC92MxZ5E4xRmmHHTYFaB_Mx0A@mail.gmail.com">
<div dir="auto"><br>
</div>
<div dir="auto">have you thought about using orchestration
tools for this (Kubernetes etc.)?</div>
</blockquote>
I have been immersed in enough IRC/forum/StackExchange traffic
that I know that people do this, but thus far I haven't seen a
need past the simple crafting of a single readily replicable
Linux instance that justifies the added complexity. In
addition to being able to make changes to that instance in
chroot on the edge node and reboot the workers and any other
daemon-running machines (to facilitate this, I set the
machines to boot to disk first and PXE second and then I have
a script that "breaks" the first drive on each worker node by
overwriting the GPT partition table and forcing a reboot), I
can still "fan" changes via ssh across all nodes serially or
simultaneously and I still have the NFS mechanism available to
me (right now, it handles only the Hadoop workers file). <br>
<br>
By the way, I've done this thing where the node setup script
opens the optical media tray, which then automatically closes
after a few minutes when the machine reboots. The on-disk
instance is set to open and immediately close the tray at
boot, so if all the machines have optical drives with trays
that can open and close on command there will be quite an
entertaining racket when a whole cluster starts up. <br>
<br>
Hey, Jim/Aaron - just think; I could have turned Sutton Middle
School into a 500-node Hadoop cluster! There's a 24-seat lab
at work that'd be good for about 3.3TiB of HDFS.<br>
<blockquote type="cite"
cite="mid:CABmokzAEgNTa99n0uA1xiS5nKC92MxZ5E4xRmmHHTYFaB_Mx0A@mail.gmail.com">
<div><br>
<div class="gmail_quote">
<div dir="ltr">On Fri, Sep 7, 2018 at 4:46 PM Jeff Hubbs
via Ale <<a href="mailto:ale@ale.org"
moz-do-not-send="true">ale@ale.org</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<p>For the past few months, I've been operating an
Apache Hadoop cluster at Emory University's
Goizueta Business School. That cluster is
Gentoo-Linux-based and consists of a dual-homed
"edge node" and three 16-GiB-RAM 16-thread
two-disk "worker" nodes. The edge node provides
NAT for the active cluster nodes and holds a
complete mirror of the Gentoo package repository
that is updated nightly. There is also an
auxiliary edge node (a one-piece Dell Vostro 320)
with xorg and xfce that I mostly use to display
exported instances of xosview from all of the
other nodes so that I can keep an eye on the
cluster's operation. Each of the worker nodes
carries a standalone Gentoo Linux instance that
was flown in via rsync from another node while
booted to a liveCD-style distribution
(SystemRescueCD, which happens to be
Gentoo-based). <br>
</p>
<p>I have since set up the main edge node to form a
"shadow cluster" in addition to the one I've been
operating. Via iPXE and dnsmasq on the edge node,
any x86_64 system that is connected to the
internal cluster network and allowed to PXE-boot
will download a stripped-down Gentoo instance via
HTTP (served up by nginx), boot to this instance
in RAM, and execute a bash script that finds,
partitions, and formats all of that system's
disks, downloads and writes to those disks a
complete Gentoo Linux instance, installs and
configures the GRUB bootloader, sets a hostname
based on the system's first NIC's MAC address, and
reboots the system into that freshly-written
instance. <br>
</p>
<p>At present, there is only one read/write NFS
export on the edge node and it holds a flat file
that Hadoop uses as a list of available worker
nodes. The list is populated by the aforementioned
node setup script after the hostname is generated.</p>
<p>Both the PXE-booted Gentoo Linux instance and the
on-disk instance are managed within a chroot on
the edge node in a manner not unlike how Gentoo
Linux is conventionally installed on a system.
Once set up as desired, these instances are
compressed into separate squashfs files and placed
in the nginx doc root. In the case of the
PXE-booted instance, there is an intermediate step
where much of the instance is stripped away just
to reduce the size of the squashfs file, which is
currently 431MiB. The full cluster node
distribution file is 1.6GiB but I sometimes
exclude the kernel source tree and local package
meta-repository to bring it down to 1.1GiB. The
on-disk footprint of the complete worker node
instance is 5.9GiB.</p>
<p>The node setup script takes the first drive it
finds and GPT-partitions it six ways: 1) a 2MiB
"spacer" for the bootloader; 2) 256MiB for /boot;
3) 32GiB for root; 4) 2xRAM for swap (this is WAY
overkill; it's set by ratio in the script and a
ratio of one or less would suffice); 5) 64GiB for
/tmp/hadoop-yarn (more about this later); 6)
whatever is left for /hdfs1. Any remaining disks
identified are single-partitioned as /hdfs2,
/hdfs3, etc. All partitions are formatted btrfs
with the exception of /boot, which is vfat for
UEFI compatibility (a route I went down because I
have one old laptop I found that was UEFI-only and
I expect that will become more the case than less
over time). A quasi-boolean in the script
optionally enables compression at mount time for
/tmp/hadoop-yarn. <br>
</p>
<p>One of Gentoo Linux's strengths is the ability to
compile software specifically for the CPU but the
node instance is set up with the gcc option
-mtune=generic. Another quasi-boolean setting in
the node setup script will change that to
-march=native but that change will only effectuate
when packages are built or rebuilt locally (as
opposed to in chroot on the edge node, where
everything must be built generic). I can couple
this feature with another feature to optionally
rebuild all the system's binaries native but
that's an operation that would take a fair bit of
time (that's over 500 packages and only some of
them would affect cluster operation). Similarly,
in the interest of run-what-ya-brung flexibility,
I'm using Gentoo's genkernel utility to generate a
kernel and initrd befitting a liveCD-style
instance that will boot on basically any x86-64
along with whatever NICs and disk controllers it
finds. <br>
</p>
<p>I am using the Hadoop binary distribution
(currently 3.1.1) as distributed directly by
Apache (no HortonWorks; no Cloudera). Each cluster
node has its own Hadoop distribution and each
node's Hadoop distribution has configuration
features both in common and specific to that node,
modified in place by the node setup script. In the
latter case, the amount of available RAM, the
number of available CPU threads, and the list of
available HDFS partitions on a system are flown
into the proper local config files. Hadoop
services run in a Java VM; I am currently using
the IcedTea 3.8.0 source distribution supplied
within Gentoo's packaging system. I have also run
it under the IcedTea binary distribution and the
Oracle JVM with equal success. <br>
</p>
<p>Hadoop has three primary constructs that make it
up. HDFS (Hadoop Distributed File System) consists
of a NameNode daemon that runs on a single machine
and controls the filesystem namespace and user
access to it; DataNode daemons run on each worker
node and coordinate between the NameNode daemon
and the local machine's on-disk filesystem. You
access the filesystem with command-line-like
options to the hdfs binary like -put, -get, -ls,
-mkdir, etc. but in the on-disk filesystem
underneath /hdfs1.../hdfsN, the files you write
are cut up into "blocks" (default size: 128MiB)
and those blocks are replicated (default: three
times) among all the worker nodes. My initial
cluster with standalone workers reported 7.2TiB of
HDFS available spread across six physical
spindles. As you can imagine, it's possible to
accumulate tens of TiB of HDFS across only a
handful of nodes but doing so isn't necessarily
helpful. <br>
</p>
<p>YARN (Yet Another Resource Negotiator) is the
construct that manages the execution of work among
the nodes. Part of the whole point behind Hadoop
is to <i>move the processing to where the data is
</i>and it's YARN that coordinates all that. It
consists of a ResourceManager daemon that
communicates with all the worker nodes and
NodeManager daemons that run on each of the worker
nodes. You can run the ResourceManager daemon and
HDFS' NameNode daemon on the same machines that
act as worker nodes but past a point you won't
want to and past <i>that</i> point you'd want to
run each of NameNode and ResourceManager on two
separate machines. In that regime, you'd have two
machines dedicated to those roles (their names
would be taken out of the centrally-located
workers file) and the rest would run both the
DataNode and NodeManager daemons, forming the HDFS
storage subsystem and the YARN execution
subsystem.</p>
<p>There is another construct, MapReduce, whose
architecture I don't fully understand yet; it
comes into play as a later phase in Hadoop
computations and there is a JobHistoryServer
daemon associated with it.</p>
<p>Another place where the bridge is out with
respect to my understanding of Hadoop is coding
for it - but I'll get there eventually. There are
other apps like Apache's Spark and Hive that use
HDFS and/or YARN that I have better mental insight
into, and I have successfully gotten Python/Spark
demo programs to run on YARN in my cluster. <br>
</p>
<p>One thing I have learned is that Hadoop clusters
do not "genericize" well. When I first tried
running the Hadoop-supplied teragen/terasort
example (goal: make a file of 10^10 100-character
lines and sort it), it failed for want of space
available in /tmp/hadoop-yarn but it ran perfectly
when the file was cut down to 1/100th that size.
For my PXE-boot-based cluster, I gave my worker
nodes a separate partition for /tmp/hadoop-yarn
and gave it optional transparent compression.
There are a lot of parameters for controlling
things like minimum size and minimum size
increment of memory containers and JVM parameters
that I haven't messed with but to optimize the
cluster for a given job, one would expect to.<br>
</p>
<p>What I have right now - basically, a single
Gentoo Linux instance for installation on a
dual-homed edge node - is able to generate a
working Hadoop cluster with an arbitrary number of
nodes, limited primarily by space, cooling, and
electric power (the Dell Optiplex desktops I'm
using right now max out at about an amp, so you
have to be prepared to supply at least N amps for
N nodes). They can be purpose-built rack-mount
servers, a lab environment full of thin clients,
or wire shelf units full of discarded desktops and
laptops. <br>
</p>
<p>- Jeff<br>
</p>
<p><br>
</p>
</div>
_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank"
moz-do-not-send="true">Ale@ale.org</a><br>
<a href="https://mail.ale.org/mailman/listinfo/ale"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://mail.ale.org/mailman/listinfo</a><br>
</blockquote>
</div>
</div>
-- <br>
<div dir="ltr" class="gmail_signature"
data-smartmail="gmail_signature">Sent from my mobile.
Please excuse the brevity, spelling, and punctuation.</div>
</blockquote>
<p><br>
</p>
</blockquote>
</div>
<br>
-- <br>
Sent from my Android device with K-9 Mail. All tyopes are thumb
related and reflect authenticity.
</blockquote>
<p><br>
</p>
</body>
</html>