
The role of fencing in high availability
Fencing is an essential feature in clustered computer systems to maintain network stability and reliability. In such environments, multiple interconnected nodes work together to provide redundancy and high availability.
However, failures are inevitable, and nodes can malfunction or go offline. Fencing isolates the faulty node, preventing it from disrupting the cluster's normal operations. This ensures the malfunctioning node does not cause data corruption, faulty or erroneous resource contention, or other issues, allowing the other members of the cluster to continue functioning correctly.
STONITH
STONITH, an acronym for Shoot The Offending Node In The Head (or Shoot The Other Node In The Head), is a technique for implementing fencing. As the name indicates, it isolates the failed node by shutting it down or powering it off.
Depending upon the use case and the need for isolation, it can be implemented at software level as well as hardware level.
Fencing with IPMI protocol
For isolating the nodes at hardware level, a widely supported industry protocol - the Intelligent Platform Management Interface (IPMI) - is used by most of the server hardware vendors.
IPMI is a message-based hardware management interface protocol. It is used by the server's Baseboard Management Controller (BMC), which is essentially an embedded computer module sitting along with the server hardware boards, that is specifically designed to provide monitoring and control of the server hardware.
In addition to just isolating the node, IPMI provides other detailed hardware level monitoring and control capability for isolating different subsystems of the server hardware. Most of the Linux distributions provide built-in or plugin support for IPMI through IPMI tools or libraries.
For implementing fencing for our Fujitsu Enterprise Postgres High Availability cluster, IPMI can be used to isolate a node in case the database or the node becomes unresponsive to other nodes due to any fault. However, this requires the host machines in the cluster to be IPMI-compatible. Secondly, the host machines should be dedicated for hosting the database only, as isolating a host at hardware level will result into isolating any other application if co-hosted. Therefore, IPMI can be used for fencing in Fujitsu Enterprise Postgres High Availability cluster if the Fujitsu Enterprise Postgres primary, secondary, and arbiter are individually hosted on separate server machines. This can be a right choice for many use cases where the environments are hosted on premise.
Fencing with Fujitsu Enterprise Postgres High Availability in the cloud
When we plan to implement a Fujitsu Enterprise Postgres High Availability cluster in the cloud, our database nodes are VMs. If we want to implement fencing, we need to isolate VMs, not the full host machine that would be hosting other VMs.
The IPIM would not work in this scenario, as it does not provide isolation for individual VMs. Secondly, if by any chance we are asking for a dedicated host for the Fujitsu Enterprise Postgres cluster on the cloud, then getting an IPIM-compatible host machine would be a considerably costly proposition.
So, the way to implement fencing is to use STONITH at VM level and isolate the VM in case the database becomes unresponsive or faulty due to some error. The VM can be isolated by calling the appropriate
Shutdown utility of the respective cloud shell from a Linux script and then performing appropriate action based on the response of the call.
Mirroring Controller – fencing script for AWS
Fujitsu Enterprise Postgres ensures high availability through the use of Mirroring Controller and Server Assistant components. If the primary database server becomes unreachable or the application network between the primary and standby servers goes down, it's necessary to isolate the primary server. This is to prevent a split-brain scenario, where both database servers could temporarily function as primary servers, potentially leading to data corruption and inconsistencies.
To avoid this split-brain situation, the Mirroring Controller uses a script to fence off the primary server showing abnormal behaviour, before automatically failing over to the standby server. Fujitsu Enterprise Postgres offers a sample fencing script that is compatible with physical servers using an IPMI console.
As more customers show interest in deploying Fujitsu Enterprise Postgres in the cloud, our team has developed a shell script to fence the database server hosted on AWS cloud. The prerequisites for this are:
- AWS CLI
- An IAM user with the authority to start and stop the AWS EC2 instance
This shell script uses the IAM user credentials for authentication and checks the status of the reported server. It then stops the primary server and continues to monitor its status until it has completely stopped.
# Notes
# This is the sample script of the fencing command for Mirroring Controller
# arbitration server in Linux version, and which makes the unstable database
# server shut down with using AWS CLI utility.
# If you want to know the detail of the specification of AWS CLI utility,
# please see the on-line reference manual for AWS CLI.
# Remarks
# If to specify the plain password to the AWS CLI prevents your system
# from satisfying the security criteria, you shouldn't use this sample.
printf -\n\t Fencing Command invoked at: $(date) \n\n" >> /mc/hb_chg.log
printf "Fencing Trigger: $1 \n" >> /mc/hb_chg.log
printf "Failover Action: $2 \n" >> /mc/hb_chg.log
printf -Server Identity: $3 \n\n" >> /mc/hb_chg.log
# Need to be modified
srvlident="serverl" # Server identify of Mirroring Controller
srv2ident="server2" # Server identify of Mirroring Controller
ec2_db_instl_id="i-003a33d235b4e1da3"
ec2_db_inst2_id="i-05a0861de448fa570"
check_interval=3 # Interval for checking the power status power-off
#srvladdr="192.0.4.100" # Remote server address of srvlident for IPMItool
#srv2addr="192.0.4.110" # Remote server address of srv2ident for IPMItool
#logdir="/var/tmp/work"
#logfile="$flogdirl/fencing.$(date '+`XY%m%d%N%1CS').log"
# Immutable variables
#pw_off=0
#pw_on=1
#pw_unknown=2
#ipmi_address="-
#pwstat4fpw_offl
srvident=$3
printf "Sry Ident: ${srvident} \n\n" >> /mc/hb_chg.log
### Functions
ec2_db_inst_status()
{
local inst_stat
inst_stat=$(/usr/local/bin/aws ec2 describe-instances --instance-ids ${target_inst_id} --query -Reservations[1.Instances[*].[State.Name]" --output text)
printf "\n\n DB instance status: ${inst_stat} \n" >> /mc/hb_chg.log
#return ${inst_stat}
}
stop_db_inst()
{
printf "\n\n\t Stopping the current Primary Server... \n\n" >> /mc/hb_chg.log
/usr/local/bin/aws ec2 stop-instances --instance-ids ${target_inst_id} --output text >> /mc/hb_chg.log
# Wait for target server becomes power down before returning
while [ true ]
do {
stop_stat=$(/usr/local/bin/aws ec2 describe-instances --instance-ids ${target_inst_id} --query "Reservations[*].Instances[*].[State.Name]" --output text)
printf "\n Primary server status at $(date) : ${stop_stat} \n" >> /mc/hb_chg.log
if [ "${stop_stat}" = -stopped" ]; then
printf "\n\n Current Primary Server - Stopped \n\n" >> /mc/hb_chg.log
exit 3
fi
/bin/sleep ${check_interval}
}
done
}
### Start
# trap 'putlog "received signal" ; exit 2' 1 2 3 6 7 11 13 15
# start $*
if [ "${srvident}" = "${srv1ident}" ];then
target_inst_id=$fec2_db_inst1_1(0.
elif [ "${srvident}" = "${srv2ident}" ];then
target_inst_id=${ec2_db_inst2_id}
else
printf "Unknown server identity 1{srvident}— >> /mc/hb_chg.log
exit 2
fi
# Get the current status of target server
ec2 db_inst_status
# Stop the primary DB server
stop db_inst