[ale] Linux Cluster Server Room

Dow Hurst dhurst at kennesaw.edu
Tue Apr 20 14:06:05 EDT 2004


Excellent point about hardware failure.  I expect that quality hardware and 
quality power makes a difference over the long haul.  I like a UPS that can 
handle a full load for about 10 minutes if needed.  Then, I'd like a backup 
generator to kick in.  The UPS would need to massage the power into something 
decent though, that is important since diesel generators can have dirty power. 
  Longevity is really important in academic environments since money comes in 
chunks that are few and usually far between.  Normally academic hardware 
doesn't get supported either, but I am fortunate in that my boss appreciates 
the power of hardware support!  We are using 10 year old technology to run the 
project here.  Any hardware bought now would still be in use in 5-7 years for 
sure.
Dow


Bjorn Dittmer-Roche wrote:
> 
> On Tue, 20 Apr 2004, Jeffrey B. Layton wrote:
> 
> 
>>Well, my response is - it depends. How long is long? How important
>>is it to you? Can you checkpoint or modify the code to checkpoint?
>>Unfortunately, there are questions you have to answer. However, let
>>me give you some things I think about.
>>
>>We run CFD codes (Computational Fluid Dynamics) to explore
>>fluid flow over and in aircraft. The runs can last up to about 48
>>hours. Our codes checkpoint themselves, so if we lose the nodes
>>(or a node since we're running MPI codes), we just back up to the
>>last checkpoint. Not a big deal. However, if we didn't checkpoint,
>>I would think about it a bit. 48 hours is long time. If the cluster
>>dies at 47:59 I would be very upset. However, if we're running
>>on a cluster with 256 nodes with UPS and if getting rid of UPS
>>means I can get 60 more nodes, then perhaps I could just run my
>>job on my more nodes and get done faster (reducing the window
>>of vulnerability if you will).
> 
> 
> Jeff touches on an important point here: what happens when you loose one
> node? You should think about the hardware's MTBF and think about how often
> you will loose a single node and what the consequences of that are. If
> your computations run for a week without checkpoints and you have a lot of
> nodes, you will have to worry about hardware failure as well as power. So
> good coding practice involves checkpoints.
> 
> At the risk of getting flamed: Have you considered alternative
> multiprocessor machines from Sun, SGI and the like? These systems have
> great reliability and let you do things like put 60 G RAM on one machine.
> 
> 
>>You also need to think about how long the UPS' will last. If you
>>need to run 48 hours and the UPS kicks in about 24 hours, will
>>the UPS last 24 hours? If not, you will lose the job anyway (with
>>no check pointing) unless you get some really big UPS'. So in this
>>case, UPS won't help much. However, it would help if you were
>>only a few minutes away from completing a computation and
>>just needed to finish (if it's a long run, the odds are this scenario
>>won't happen often). If you could just touch a file and have your
>>code recognize this so it could quickly check point, then a UPS
>>might be worth it (some of our codes do this).
> 
> 
> Most power problems where I used to work were very brief. I don't know
> about what things are like here in Georgia, or weather or not you have
> backup generators, but a UPS that gives you 30 seconds will get you
> through a lot of tough spots and will save you from loosing your
> computations because of a ten second power outage. If you want to ride
> over major blackouts, a small UPS and a generator will be more cost
> effective than a large UPS, but again, what's the point when your node
> MTBF is on the same order as the frequency of power outages.
> 
> 	bjorn
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale
> 

-- 
__________________________________________________________
Dow Hurst                  Office: 770-499-3428            *
Systems Support Specialist    Fax: 770-423-6744            *
1000 Chastain Rd. Bldg. 12                                 *
Chemistry Department SC428  Email:   dhurst at kennesaw.edu   *
Kennesaw State University         Dow.Hurst at mindspring.com *
Kennesaw, GA 30144                                         *
************************************************************
This message (including any attachments) contains          *
confidential information intended for a specific individual*
and purpose, and is protected by law.  If you are not the  *
intended recipient, you should delete this message and are *
hereby notified that any disclosure, copying, distribution *
of this message, or the taking of any action based on it,  *
is strictly prohibited.                                    *
************************************************************



More information about the Ale mailing list