<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 9/9/18 9:35 AM, Jim Kinney wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      Ha!!<br>

      <br>

      My ultimate plan with the APS project was to cobble all the "waste

      machines" together into a system-wide distributed storage cluster

      then run a single-system image cluster tool like mosics or

      kerrigh. That would make the entire school district a single

      instance, giant super computer that K12 students use :-)<br>

    </blockquote>

    I haven't had occasion to use this yet but Hadoop has a thing where

    you can divide your nodes into "racks" such that the ResourceManager

    knows to expect network bottlenecks between groups of nodes and

    moves processing and storage accordingly; in the APS environment one

    would group each school into a "rack" (no idea if Hadoop supports

    "racks" of "racks").<br>

    <blockquote type="cite"

      cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">

      <br>

      Excellent work on the hadoop cluster!<br>

    </blockquote>

    Thanks! It's been interesting and I've covered a few things along

    the way I'd long wanted to be able to do (like PXE-booting), so,

    bonus.<br>

    <blockquote type="cite"

      cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">

      <br>

      I have my cluster nodes set to always pxe boot and the pxe boot

      default is to fall back to local boot drive. That way I can drive

      a new install/rebuild by twiddling a file on the DHCP server and

      rebooting a node. Eventually, the nodes will use no local storage

      as all will be reserved for /tmp (raid0 across all drives for

      speed) and use an NFS mounted root and remote log file process.

      Basically a homegrown PAAS set up controlled by job submission

      defined need. <br>

    </blockquote>

    I thought about running the working instance out of RAM leaving

    everything NFS-based but I decided against it. For one thing, Hadoop

    activity alone hammers the LAN and if a lot of additional traffic

    (the Hadoop binary distribution is about 830MiB) is trying to

    concentrate on, say, the edge node where the worker node instance

    actually lives while there's a lot of HDFS traffic shooting from

    node to node, that's not cool. <br>

    <blockquote type="cite"

      cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com">

      <br>

      My current jobs are compiled matlab and custom python. Hadoop is

      coming back. Can't run hadoop on the same nodes with the others as

      it assumes full system control. Not enough demand yet for

      dedicated hadoop stack. So a reboot to nfs root hadoop cluster

      with a temp "node offline" status in torque seems to be feasible.

      Use pre-run scripts in torque to call reboot to hadoop node and

      post-run to reboot to normal cluster mode.<br>

    </blockquote>

    Sounds reasonable. <br>

    <blockquote type="cite"

      cite="mid:B99091A3-0AFB-4876-9CDB-A8A24835547D@gmail.com"><br>

      <div class="gmail_quote">On September 7, 2018 11:54:04 PM EDT,

        Jeff Hubbs via Ale <a class="moz-txt-link-rfc2396E" href="mailto:ale@ale.org"><ale@ale.org></a> wrote:

        <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt

          0.8ex; border-left: 1px solid rgb(204, 204, 204);

          padding-left: 1ex;">

          <div class="moz-cite-prefix">On 9/7/18 4:52 PM, dev null zero

            two wrote:<br>

          </div>

          <blockquote type="cite"

cite="mid:CABmokzAEgNTa99n0uA1xiS5nKC92MxZ5E4xRmmHHTYFaB_Mx0A@mail.gmail.com">

            <meta http-equiv="content-type" content="text/html;

              charset=utf-8">

            <div>

              <div dir="auto">that is pretty incredible (never thought

                to use Gentoo for this purpose).</div>

            </div>

          </blockquote>

          I use it for everything. :)<br>

          <blockquote type="cite"

cite="mid:CABmokzAEgNTa99n0uA1xiS5nKC92MxZ5E4xRmmHHTYFaB_Mx0A@mail.gmail.com">

            <div dir="auto"><br>

            </div>

            <div dir="auto">have you thought about using orchestration

              tools for this (Kubernetes etc.)?</div>

          </blockquote>

          I have been immersed in enough IRC/forum/StackExchange traffic

          that I know that people do this, but thus far I haven't seen a

          need past the simple crafting of a single readily replicable

          Linux instance that justifies the added complexity. In

          addition to being able to make changes to that instance in

          chroot on the edge node and reboot the workers and any other

          daemon-running machines (to facilitate this, I set the

          machines to boot to disk first and PXE second and then I have

          a script that "breaks" the first drive on each worker node by

          overwriting the GPT partition table and forcing a reboot), I

          can still "fan" changes via ssh across all nodes serially or

          simultaneously and I still have the NFS mechanism available to

          me (right now, it handles only the Hadoop workers file). <br>

          <br>

          By the way, I've done this thing where the node setup script

          opens the optical media tray, which then automatically closes

          after a few minutes when the machine reboots. The on-disk

          instance is set to open and immediately close the tray at

          boot, so if all the machines have optical drives with trays

          that can open and close on command there will be quite an

          entertaining racket when a whole cluster starts up. <br>

          <br>

          Hey, Jim/Aaron - just think; I could have turned Sutton Middle

          School into a 500-node Hadoop cluster! There's a 24-seat lab

          at work that'd be good for about 3.3TiB of HDFS.<br>

          <blockquote type="cite"

cite="mid:CABmokzAEgNTa99n0uA1xiS5nKC92MxZ5E4xRmmHHTYFaB_Mx0A@mail.gmail.com">

            <div><br>

              <div class="gmail_quote">

                <div dir="ltr">On Fri, Sep 7, 2018 at 4:46 PM Jeff Hubbs

                  via Ale <<a href="mailto:ale@ale.org"

                    moz-do-not-send="true">ale@ale.org</a>> wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  <div bgcolor="#FFFFFF" text="#000000">

                    <p>For the past few months, I've been operating an

                      Apache Hadoop cluster at Emory University's

                      Goizueta Business School. That cluster is

                      Gentoo-Linux-based and consists of a dual-homed

                      "edge node" and three 16-GiB-RAM 16-thread

                      two-disk "worker" nodes. The edge node provides

                      NAT for the active cluster nodes and holds a

                      complete mirror of the Gentoo package repository

                      that is updated nightly. There is also an

                      auxiliary edge node (a one-piece Dell Vostro 320)

                      with xorg and xfce that I mostly use to display

                      exported instances of xosview from all of the

                      other nodes so that I can keep an eye on the

                      cluster's operation. Each of the worker nodes

                      carries a standalone Gentoo Linux instance that

                      was flown in via rsync from another node while

                      booted to a liveCD-style distribution

                      (SystemRescueCD, which happens to be

                      Gentoo-based). <br>

                    </p>

                    <p>I have since set up the main edge node to form a

                      "shadow cluster" in addition to the one I've been

                      operating. Via iPXE and dnsmasq on the edge node,

                      any x86_64 system that is connected to the

                      internal cluster network and allowed to PXE-boot

                      will download a stripped-down Gentoo instance via

                      HTTP (served up by nginx), boot to this instance

                      in RAM, and execute a bash script that finds,

                      partitions, and formats all of that system's

                      disks, downloads and writes to those disks a

                      complete Gentoo Linux instance, installs and

                      configures the GRUB bootloader, sets a hostname

                      based on the system's first NIC's MAC address, and

                      reboots the system into that freshly-written

                      instance. <br>

                    </p>

                    <p>At present, there is only one read/write NFS

                      export on the edge node and it holds a flat file

                      that Hadoop uses as a list of available worker

                      nodes. The list is populated by the aforementioned

                      node setup script after the hostname is generated.</p>

                    <p>Both the PXE-booted Gentoo Linux instance and the

                      on-disk instance are managed within a chroot on

                      the edge node in a manner not unlike how Gentoo

                      Linux is conventionally installed on a system.

                      Once set up as desired, these instances are

                      compressed into separate squashfs files and placed

                      in the nginx doc root. In the case of the

                      PXE-booted instance, there is an intermediate step

                      where much of the instance is stripped away just

                      to reduce the size of the squashfs file, which is

                      currently 431MiB. The full cluster node

                      distribution file is 1.6GiB but I sometimes

                      exclude the kernel source tree and local package

                      meta-repository to bring it down to 1.1GiB. The

                      on-disk footprint of the complete worker node

                      instance is 5.9GiB.</p>

                    <p>The node setup script takes the first drive it

                      finds and GPT-partitions it six ways: 1) a 2MiB

                      "spacer" for the bootloader; 2) 256MiB for /boot;

                      3) 32GiB for root; 4) 2xRAM for swap (this is WAY

                      overkill; it's set by ratio in the script and a

                      ratio of one or less would suffice); 5) 64GiB for

                      /tmp/hadoop-yarn (more about this later); 6)

                      whatever is left for /hdfs1. Any remaining disks

                      identified are single-partitioned as /hdfs2,

                      /hdfs3, etc. All partitions are formatted btrfs

                      with the exception of /boot, which is vfat for

                      UEFI compatibility (a route I went down because I

                      have one old laptop I found that was UEFI-only and

                      I expect that will become more the case than less

                      over time). A quasi-boolean in the script

                      optionally enables compression at mount time for

                      /tmp/hadoop-yarn. <br>

                    </p>

                    <p>One of Gentoo Linux's strengths is the ability to

                      compile software specifically for the CPU but the

                      node instance is set up with the gcc option

                      -mtune=generic. Another quasi-boolean setting in

                      the node setup script will change that to

                      -march=native but that change will only effectuate

                      when packages are built or rebuilt locally (as

                      opposed to in chroot on the edge node, where

                      everything must be built generic). I can couple

                      this feature with another feature to optionally

                      rebuild all the system's binaries native but

                      that's an operation that would take a fair bit of

                      time (that's over 500 packages and only some of

                      them would affect cluster operation). Similarly,

                      in the interest of run-what-ya-brung flexibility,

                      I'm using Gentoo's genkernel utility to generate a

                      kernel and initrd befitting a liveCD-style

                      instance that will boot on basically any x86-64

                      along with whatever NICs and disk controllers it

                      finds. <br>

                    </p>

                    <p>I am using the Hadoop binary distribution

                      (currently 3.1.1) as distributed directly by

                      Apache (no HortonWorks; no Cloudera). Each cluster

                      node has its own Hadoop distribution and each

                      node's Hadoop distribution has configuration

                      features both in common and specific to that node,

                      modified in place by the node setup script. In the

                      latter case, the amount of available RAM, the

                      number of available CPU threads, and the list of

                      available HDFS partitions on a system are flown

                      into the proper local config files. Hadoop

                      services run in a Java VM; I am currently using

                      the IcedTea 3.8.0 source distribution supplied

                      within Gentoo's packaging system. I have also run

                      it under the IcedTea binary distribution and the

                      Oracle JVM with equal success. <br>

                    </p>

                    <p>Hadoop has three primary constructs that make it

                      up. HDFS (Hadoop Distributed File System) consists

                      of a NameNode daemon that runs on a single machine

                      and controls the filesystem namespace and user

                      access to it; DataNode daemons run on each worker

                      node and coordinate between the NameNode daemon

                      and the local machine's on-disk filesystem. You

                      access the filesystem with command-line-like

                      options to the hdfs binary like -put, -get, -ls,

                      -mkdir, etc. but in the on-disk filesystem

                      underneath /hdfs1.../hdfsN, the files you write

                      are cut up into "blocks" (default size: 128MiB)

                      and those blocks are replicated (default: three

                      times) among all the worker nodes. My initial

                      cluster with standalone workers reported 7.2TiB of

                      HDFS available spread across six physical

                      spindles. As you can imagine, it's possible to

                      accumulate tens of TiB of HDFS across only a

                      handful of nodes but doing so isn't necessarily

                      helpful. <br>

                    </p>

                    <p>YARN (Yet Another Resource Negotiator) is the

                      construct that manages the execution of work among

                      the nodes. Part of the whole point behind Hadoop

                      is to <i>move the processing to where the data is

                      </i>and it's YARN that coordinates all that. It

                      consists of a ResourceManager daemon that

                      communicates with all the worker nodes and

                      NodeManager daemons that run on each of the worker

                      nodes. You can run the ResourceManager daemon and

                      HDFS' NameNode daemon on the same machines that

                      act as worker nodes but past a point you won't

                      want to and past <i>that</i> point you'd want to

                      run each of NameNode and ResourceManager on two

                      separate machines. In that regime, you'd have two

                      machines dedicated to those roles (their names

                      would be taken out of the centrally-located

                      workers file) and the rest would run both the

                      DataNode and NodeManager daemons, forming the HDFS

                      storage subsystem and the YARN execution

                      subsystem.</p>

                    <p>There is another construct, MapReduce, whose

                      architecture I don't fully understand yet; it

                      comes into play as a later phase in Hadoop

                      computations and there is a JobHistoryServer

                      daemon associated with it.</p>

                    <p>Another place where the bridge is out with

                      respect to my understanding of Hadoop is coding

                      for it - but I'll get there eventually. There are

                      other apps like Apache's Spark and Hive that use

                      HDFS and/or YARN that I have better mental insight

                      into, and I have successfully gotten Python/Spark

                      demo programs to run on YARN in my cluster. <br>

                    </p>

                    <p>One thing I have learned is that Hadoop clusters

                      do not "genericize" well. When I first tried

                      running the Hadoop-supplied teragen/terasort

                      example (goal: make a file of 10^10 100-character

                      lines and sort it), it failed for want of space

                      available in /tmp/hadoop-yarn but it ran perfectly

                      when the file was cut down to 1/100th that size.

                      For my PXE-boot-based cluster, I gave my worker

                      nodes a separate partition for /tmp/hadoop-yarn

                      and gave it optional transparent compression.

                      There are a lot of parameters for controlling

                      things like minimum size and minimum size

                      increment of memory containers and JVM parameters

                      that I haven't messed with but to optimize the

                      cluster for a given job, one would expect to.<br>

                    </p>

                    <p>What I have right now - basically, a single

                      Gentoo Linux instance for installation on a

                      dual-homed edge node - is able to generate a

                      working Hadoop cluster with an arbitrary number of

                      nodes, limited primarily by space, cooling, and

                      electric power (the Dell Optiplex desktops I'm

                      using right now max out at about an amp, so you

                      have to be prepared to supply at least N amps for

                      N nodes). They can be purpose-built rack-mount

                      servers, a lab environment full of thin clients,

                      or wire shelf units full of discarded desktops and

                      laptops. <br>

                    </p>

                    <p>- Jeff<br>

                    </p>

                    <p><br>

                    </p>

                  </div>

                  _______________________________________________<br>

                  Ale mailing list<br>

                  <a href="mailto:Ale@ale.org" target="_blank"

                    moz-do-not-send="true">Ale@ale.org</a><br>

                  <a href="https://mail.ale.org/mailman/listinfo/ale"

                    rel="noreferrer" target="_blank"

                    moz-do-not-send="true">https://mail.ale.org/mailman/listinfo/ale</a><br>

                  See JOBS, ANNOUNCE and SCHOOLS lists at<br>

                  <a href="http://mail.ale.org/mailman/listinfo"

                    rel="noreferrer" target="_blank"

                    moz-do-not-send="true">http://mail.ale.org/mailman/listinfo</a><br>

                </blockquote>

              </div>

            </div>

            -- <br>

            <div dir="ltr" class="gmail_signature"

              data-smartmail="gmail_signature">Sent from my mobile.

              Please excuse the brevity, spelling, and punctuation.</div>

          </blockquote>

          <p><br>

          </p>

        </blockquote>

      </div>

      <br>

      -- <br>

      Sent from my Android device with K-9 Mail. All tyopes are thumb

      related and reflect authenticity.

    </blockquote>

    <p><br>

    </p>

  </body>

</html>