[ale] Linux Cluster Server Room
Dow Hurst
dhurst at kennesaw.edu
Tue Apr 20 14:06:05 EDT 2004
Excellent point about hardware failure. I expect that quality hardware and
quality power makes a difference over the long haul. I like a UPS that can
handle a full load for about 10 minutes if needed. Then, I'd like a backup
generator to kick in. The UPS would need to massage the power into something
decent though, that is important since diesel generators can have dirty power.
Longevity is really important in academic environments since money comes in
chunks that are few and usually far between. Normally academic hardware
doesn't get supported either, but I am fortunate in that my boss appreciates
the power of hardware support! We are using 10 year old technology to run the
project here. Any hardware bought now would still be in use in 5-7 years for
sure.
Dow
Bjorn Dittmer-Roche wrote:
>
> On Tue, 20 Apr 2004, Jeffrey B. Layton wrote:
>
>
>>Well, my response is - it depends. How long is long? How important
>>is it to you? Can you checkpoint or modify the code to checkpoint?
>>Unfortunately, there are questions you have to answer. However, let
>>me give you some things I think about.
>>
>>We run CFD codes (Computational Fluid Dynamics) to explore
>>fluid flow over and in aircraft. The runs can last up to about 48
>>hours. Our codes checkpoint themselves, so if we lose the nodes
>>(or a node since we're running MPI codes), we just back up to the
>>last checkpoint. Not a big deal. However, if we didn't checkpoint,
>>I would think about it a bit. 48 hours is long time. If the cluster
>>dies at 47:59 I would be very upset. However, if we're running
>>on a cluster with 256 nodes with UPS and if getting rid of UPS
>>means I can get 60 more nodes, then perhaps I could just run my
>>job on my more nodes and get done faster (reducing the window
>>of vulnerability if you will).
>
>
> Jeff touches on an important point here: what happens when you loose one
> node? You should think about the hardware's MTBF and think about how often
> you will loose a single node and what the consequences of that are. If
> your computations run for a week without checkpoints and you have a lot of
> nodes, you will have to worry about hardware failure as well as power. So
> good coding practice involves checkpoints.
>
> At the risk of getting flamed: Have you considered alternative
> multiprocessor machines from Sun, SGI and the like? These systems have
> great reliability and let you do things like put 60 G RAM on one machine.
>
>
>>You also need to think about how long the UPS' will last. If you
>>need to run 48 hours and the UPS kicks in about 24 hours, will
>>the UPS last 24 hours? If not, you will lose the job anyway (with
>>no check pointing) unless you get some really big UPS'. So in this
>>case, UPS won't help much. However, it would help if you were
>>only a few minutes away from completing a computation and
>>just needed to finish (if it's a long run, the odds are this scenario
>>won't happen often). If you could just touch a file and have your
>>code recognize this so it could quickly check point, then a UPS
>>might be worth it (some of our codes do this).
>
>
> Most power problems where I used to work were very brief. I don't know
> about what things are like here in Georgia, or weather or not you have
> backup generators, but a UPS that gives you 30 seconds will get you
> through a lot of tough spots and will save you from loosing your
> computations because of a ten second power outage. If you want to ride
> over major blackouts, a small UPS and a generator will be more cost
> effective than a large UPS, but again, what's the point when your node
> MTBF is on the same order as the frequency of power outages.
>
> bjorn
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale
>
--
__________________________________________________________
Dow Hurst Office: 770-499-3428 *
Systems Support Specialist Fax: 770-423-6744 *
1000 Chastain Rd. Bldg. 12 *
Chemistry Department SC428 Email: dhurst at kennesaw.edu *
Kennesaw State University Dow.Hurst at mindspring.com *
Kennesaw, GA 30144 *
************************************************************
This message (including any attachments) contains *
confidential information intended for a specific individual*
and purpose, and is protected by law. If you are not the *
intended recipient, you should delete this message and are *
hereby notified that any disclosure, copying, distribution *
of this message, or the taking of any action based on it, *
is strictly prohibited. *
************************************************************
More information about the Ale
mailing list