[ale] Linux HA

Jeremy T. Bouse jeremy.bouse at undergrid.net
Wed Oct 31 15:31:19 EDT 2007


Christopher Fowler wrote:
> I've been testing some stuff in regards to Linux HA today.  Normally we
> sell 2 servers.  One is a "master" and the other is a "slave".  I've
> been testing today the capability to use a floating IP address and allow
> the slave to take over for the master.  I have a few issues that do need
> to be resolved before I can roll this out.  In my lab and colo I
> experienced 2 issues that HA could not have saved me from.
>
> #1.  Kernel not responding.
>
> In this case I can ping the server.  All connect()'s from clients
> seem to hang until they timeout.  In this scenario my slave will take
> the IP address but the master will still have it and still answer pings.
> Also he will still answer arp requests.  HA can't save me here.
>   
    You can use a watchdog that will reboot the system if it fails to
read/write from the watchdog device. This should keep the system from
kernel hangs holding the HA up.
> #2.  Kernel and programs still respond but disks are off
>
> In this case I/O to drives was hosed.  Apache would serve up pages that
> were in memory but any request in a page on disk would result in that
> connection hanging forever.  No I/O possible.  In this scenario the
> heartbeat agent will probably still see a server that is working but the
> reality would be a DoS condition.  Also upon seeing this issue I'm still
> left with a server who will not relinquish his IP address.
>   
    What kinda I/O are you using? I believe if you're using fiber
channel you can use the hbaping option to include your device as a ping
node to check.
> In both cases it seems my only recourse is to allow my slave to also
> control the power of the master.  If #1 and #2 exist the slave can
> simply take the floating IP and make a determination if he needs to kill
> power.  If so he can kill power and then the master can be repaired.
>
> Ideas?
>   
    Are you using a cross-connect cable and serial cable to otherwise
monitor your cluster nodes? What about a PDU that can be handled through
the STONITH module?
> Chris



More information about the Ale mailing list