[ale] Linux Cluster Server Room

Jonathan Glass IBB jonathan.glass at ibb.gatech.edu
Tue Apr 20 07:53:14 EDT 2004


Well, in this case we're lucky.  We have two backup generators that kick
in within 10 seconds.

Thanks

Jonathan

On Tue, 2004-04-20 at 07:11, Jeffrey B. Layton wrote:
> Well, my response is - it depends. How long is long? How important
> is it to you? Can you checkpoint or modify the code to checkpoint?
> Unfortunately, there are questions you have to answer. However, let
> me give you some things I think about.
> 
> We run CFD codes (Computational Fluid Dynamics) to explore
> fluid flow over and in aircraft. The runs can last up to about 48
> hours. Our codes checkpoint themselves, so if we lose the nodes
> (or a node since we're running MPI codes), we just back up to the
> last checkpoint. Not a big deal. However, if we didn't checkpoint,
> I would think about it a bit. 48 hours is long time. If the cluster
> dies at 47:59 I would be very upset. However, if we're running
> on a cluster with 256 nodes with UPS and if getting rid of UPS
> means I can get 60 more nodes, then perhaps I could just run my
> job on my more nodes and get done faster (reducing the window
> of vulnerability if you will).
> 
> You also need to think about how long the UPS' will last. If you
> need to run 48 hours and the UPS kicks in about 24 hours, will
> the UPS last 24 hours? If not, you will lose the job anyway (with
> no check pointing) unless you get some really big UPS'. So in this
> case, UPS won't help much. However, it would help if you were
> only a few minutes away from completing a computation and
> just needed to finish (if it's a long run, the odds are this scenario
> won't happen often). If you could just touch a file and have your
> code recognize this so it could quickly check point, then a UPS
> might be worth it (some of our codes do this).
> 
> Unfortunately, there is no easy answer. You need to figure out
> the answers yourself :)
> 
> Good Luck!
> 
> Jeff
> 
> P.S. Dow - notice my address change. You can talk to me off
> line if you want.
> 
> > I understand your philosophy here but have a question?  What if the 
> > calculations are long and costly to restart?  Shouldn't I look at the 
> > value of spent computation that might have to be done over if I lose 
> > power?  The code I am most concerned about running on the cluster may 
> > or may not be checkpointable.  I think it might be, but I know my 
> > users and they won't want power to be an issue with predicting when 
> > their jobs will finish. ;-)
> >
> > Are Best UPS better performing than Tripplite or APC?  I have 
> > experience with Tripplite, APC, and Leibert so far and never used 
> > Best.  I like the toughness and quality of the enclosure of the APC 
> > and Leibert.  I like the quality of all three.  I like the performance 
> > and cost of APC and Tripplite.  Tripplite's cases or enclosures on the 
> > low end aren't as nice as APC, but when you get the high UPSes they 
> > have nice rack enclosures.  Performance wise, I haven't been able to 
> > tell a difference between the two.  Heat production leans toward APC 
> > producing less overall.
> >
> > What do you mean by getting the wrong power factor conversion? Do you 
> > mean getting 120v at 60Hz vs 220v at 60Hz on the output outlets?
> >
> > I appreciate all this advice!
> > Dow
> >
> >
> >
> > Jeffrey B. Layton wrote:
> >
> >> I'll give you my 2 cents about clusters and UPS's if you wish.
> >>
> >> A good cluster configuration will treat each compute node as
> >> an appliance. You don't really care about it too much and it
> >> doesn't hold any data of any importance. What you care about
> >> is the master node and/or where the data is stored These
> >> machines can have their own UPS or a single UPS to cover
> >> the machines (they may be more than one). Then take the cost
> >> savings (if you can) and put them into more nodes, or a better
> >> interconnect (if needed), or a large file system, or a better
> >> backup system, or .... well, you get the picture.
> >>
> >> Thinking of only putting a UPS on the important parts of the
> >> cluster will save you money, time, and headaches. However,
> >> if you put a cluster in a server room you can have all power
> >> covered by a single huge UPS and probably a diesel backup
> >> generator as well. This goes back to the purpose of a server
> >> room - to support independent servers, not clusters. While this
> >> is nice and good, it is somewhat wasteful. If you could have
> >> a combination of UPS/Diesel backed power and just regular
> >> conditioned power, that would be more economical. However,
> >> the budgets for clusters (computing) and the budget for facilities
> >> are never really seen as related by management. Even though
> >> they come out of the same overall pot within the company (or
> >> university), management has a tendency to compartmentalize
> >> things for easy managing (and the definite lack of brain power
> >> on the part of most managers). Try arguing that you really
> >> don't need the giant UPS/Diesel combo and you will get IT
> >> managers screaming all sorts of things about you. Sigh.
> >>
> >> Of course, these comments depend on your cluster configuration.
> >> If you are running a global filesystem across all of the nodes,
> >> so that each node has part of the filesystem, then you might
> >> want to think about a good UPS for all of the nodes (try
> >> restoring a 20 TB global filesystem from backup after a
> >> power outage).
> >>
> >> Good Luck!
> >>
> >> Jeff
> >>
> >>> What type of UPS system are you using? Do most install a large UPS 
> >>> system for the entire server room? If so, how much will this cost?
> >>>
> >>> Thanks,
> >>> Chris
> >>>
> >>> -----Original Message-----
> >>> From: Dow Hurst [mailto:dhurst at kennesaw.edu]
> >>> Sent: Monday, April 12, 2004 11:20 AM
> >>> To: ale
> >>> Subject: Re: [ale] Linux Cluster Server Room
> >>>
> >>>
> >>> Thanks Jonathon!  That is exactly the kind of ballpark I needed!  I 
> >>> don't need
> >>> the vendors right now as we are still kicking around ideas.  If 
> >>> anyone would
> >>> throw some specs or ideas out there, I'd appreciate it.  Here is a 
> >>> quick
> >>> question?  Is planning for double your planned load a good rule?  I 
> >>> would
> >>> think that would be a good idea.  How about backup cooling if the 
> >>> main unit
> >>> dies?  The firesafe is one I had not thought of.
> >>> Dow
> >>>
> >>>
> >>> Jonathan Glass (IBB) wrote:
> >>>  
> >>>
> >>>> How big are the Opteron nodes?  Are they 1,2,4U?  How big are the 
> >>>> power
> >>>> supplies?  What is the maximum draw you expect?  Convert that 
> >>>> number to
> >>>> figure out how much heat dissipation you'll need to handle.
> >>>>
> >>>> I have a 3-ton A/C unit in my 14|15 x 14|15 server room, and the 24-33
> >>>> node cluster I just spec'd out from IBM (1U, Dual Opterons) was 
> >>>> rated at
> >>>> a max heat dissipation (is this the right word?) of 18,000 BTU. 
> >>>> According to my A/C guy, the 3-ton unit can handle a max of 36,000 
> >>>> BTU,
> >>>> so I'm well inside my limits.  Getting the 3-ton unit installed in the
> >>>> drop-down ceiling, including installing new chilled water lines, was
> >>>> around $20K.
> >>>>
> >>>> I do have sprinkler fire protection, but that room is set to 
> >>>> release its
> >>>> water supply independent of the other rooms. Also, supposedly, the 
> >>>> fire
> >>>> sprinkler heads (whatever they're called) withstand considerably more
> >>>> heat than normal ones.  So, the reasoning goes, if it gets hot enough
> >>>> for those to go off, I have bigger problems than just water.  Thus, I
> >>>> have a fire safe nearby (in the same bldg...yeah, yeah, I know; 
> >>>> off-site
> >>>> storage!) that holds my tapes, and will shortly hold a hardware
> >>>> inventory and admin password list on all my servers.
> >>>>
> >>>> If you want my list of vendors, send me an email off-list, or call my
> >>>> office, and I'll see if I can track down the DPOs for you.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Jonathan Glass
> >>>>
> >>>> On Fri, 2004-04-09 at 17:35, Dow Hurst wrote:
> >>>>
> >>>>  
> >>>>
> >>>>> If I needed to take an existing space 400 square feet w/8' 
> >>>>> ceiling, 20'x20'x8', and add A/C and fire protection for a server 
> >>>>> room, what kind of cost would be incurred?  Sounds like an algebra 
> >>>>> problem from highschool doesn't it?  Let's say a full 84" rack of 
> >>>>> 4CPU Opteron nodes and supporting hardware were in the room.  Does 
> >>>>> anyone have any ballpark figures they could throw out there?  Any 
> >>>>> links I could be pointed to?
> >>>>> Thank a bunch,
> >>>>> Dow
> >>>>>
> >>>>>
> >>>>> PS.  I'd like some other type of fire protection than sprinkler 
> >>>>> heads. ;-)
> >>>>>    
> >>>>
> 
> 
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale



More information about the Ale mailing list