[ale] Linux HA
Michael H. Warfield
mhw at WittsEnd.com
Wed Oct 31 16:45:29 EDT 2007
On Wed, 2007-10-31 at 16:35 -0400, James P. Kinney III wrote:
> On Wed, 2007-10-31 at 15:28 -0400, Charles Shapiro wrote:
> > I believe that what you are looking for is what one ALE lecturer
> > called a "STONITH Interface" ("Shoot the Other Node in the Head").
> And the poor man's STONITH interface - the X10 appliance module.
And that is no joke brother! Amen!
If you get some of the intelligent interfaces, you can even program in
some safety macros. I've got a site with 6 boxes on X-10 controller
modules and three serial X-10 controller interfaces and if they see a
machine commanded off, they can order it back on after a period of time.
That can be a good thing or a bad thing, depending on your application
but, in this case, if you killed the master system and it comes back up
and recovers it should naturally become the slave, swapping roles and
allowing you to access it and determine what went wrong.
MAIN reason I put the "dead man switch" (or what I like to call the
"Mars Rover Reboot Recovery" logic) in there was on the off chance that
some stray X-10 command would result in an "all off" or maybe a
programing error shooting the wrong node in the head. :-/ Safety
first. Has never happened but those X-10 controllers have saved my BUTT
on several occasions.
Redundancy is definitely your friend. Make sure each machine has a
interface to control all the controllers. I have needed the triple
redundancy on occasion.
Mike
> > -- CHS
> >
> >
> > On 10/31/07, Christopher Fowler <cfowler at outpostsentinel.com> wrote:
> > I've been testing some stuff in regards to Linux HA
> > today. Normally we
> > sell 2 servers. One is a "master" and the other is a
> > "slave". I've
> > been testing today the capability to use a floating IP address
> > and allow
> > the slave to take over for the master. I have a few issues
> > that do need
> > to be resolved before I can roll this out. In my lab and colo
> > I
> > experienced 2 issues that HA could not have saved me from.
> >
> > #1. Kernel not responding.
> >
> > In this case I can ping the server. All connect()'s from
> > clients
> > seem to hang until they timeout. In this scenario my slave
> > will take
> > the IP address but the master will still have it and still
> > answer pings.
> > Also he will still answer arp requests. HA can't save me
> > here.
> >
> > #2. Kernel and programs still respond but disks are off
> >
> > In this case I/O to drives was hosed. Apache would serve up
> > pages that
> > were in memory but any request in a page on disk would result
> > in that
> > connection hanging forever. No I/O possible. In this
> > scenario the
> > heartbeat agent will probably still see a server that is
> > working but the
> > reality would be a DoS condition. Also upon seeing this issue
> > I'm still
> > left with a server who will not relinquish his IP address.
> >
> > In both cases it seems my only recourse is to allow my slave
> > to also
> > control the power of the master. If #1 and #2 exist the slave
> > can
> > simply take the floating IP and make a determination if he
> > needs to kill
> > power. If so he can kill power and then the master can be
> > repaired.
> >
> > Ideas?
> >
> > Chris
> >
> >
> >
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > http://www.ale.org/mailman/listinfo/ale
> >
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > http://www.ale.org/mailman/listinfo/ale
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale
--
Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
More information about the Ale
mailing list