
A customer keen on using Mirroring Controller enquired about its usage without an arbitration server. So the Center of Excellence team went to work and tested 10 different scenarios of a Fujitsu Enterprise Postgres environment running without an arbitration server, and then performed the test of the same 10 scenarios in an environment with an artibtation server.
The testing highlighted the effectiveness of the arbitration server in preventing split brain scenarios - more on this later.
But before proceeding, let me take a step back and explain some of the Fujitsu Enterprise Postgres components discussed here.
What is Mirroring Controller?
The Mirroring Controller is a module within Fujitsu Enterprise Postgres that continually monitors the system to switch online transaction processing to a standby server seamlessly in case an abnormality is detected.
This feature ensures business continuity and is a must-have for high-availability, mission-critical systems, which by definition require maximum uptime. It accomplishes this through system monitoring and automatic switch/disconnect.
What is the arbitration server?
Fujitsu Enterprise Postgres provides the Server Assistant component, which determines the status of database servers as a third party, and if necessary, isolates affected databases. The server running the Server Assistant - the arbitration server - evaluates the status of the primary server reported by the standby server (or vice versa), and takes the appropriate action, such as fencing the malfunctioning server.
How does the arbitration server prevent split brain scenarios?
The term split brain refers to a scenario when multiple nodes within a cluster simultaneously assume they are the primary node. This situation can arise when a backup server transitions to primary status while the original primary server recovers and reconnects.
If each server considers it is active and commits any transactions that the other does not, then you cannot easily resynchronize the databases and merge the information on each server.
The Mirroring Controller arbitration server plays a vital role in addressing this issue. Within a master-standby configuration, it monitors server health and intervenes when detecting anomalies, effectively preventing data inconsistency between nodes.
Test case scenarios
Here are the scenarios that we tested, first without an arbitration server, and then with an arbitration server.
# | Test case scenario | Result without arbitration server | Result with arbitration server |
1 | Primary is stopped; Linux OS is shut down |
Failed over, as expected. Arbitration_cmd and Fencing_cmd are executed. |
Failed over, as expected. Fencing_cmd is executed. |
2 | Primary database instance is brought down using pg_ctl stop |
Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed. |
Failed over, as expected. Fencing_cmd is not executed. |
3 | Postmaster process is killed on primary |
Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed. |
Failed over, as expected. Fencing_cmd is not executed. |
4 | Storage error: Data storage full on primary |
Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed. |
Failed over, as expected. Fencing_cmd is not executed. |
5 | Storage error: WAL directory full on primary |
Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed. |
Failed over, as expected. Fencing_cmd i not executed. |
6 | Mirroring Controller is stopped on primary using mc_ctl stop -M /mc |
No failover, as expected. Arbiter and fencing scripts are not executed. |
No failover, as expected. Fencing_cmd is not executed. |
7 | Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e |
No failover, as expected. Arbiter and fencing scripts are not executed. Database instance on primary is up and running. |
No failover, as expected. Fencing script is not executed. Database instance on primary is up and running without any event logged for crash. |
8 | Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running |
No failover, as expected. Arbiter and fencing scripts are not executed. Database instance on primary is up and running. |
No failover, as expected. Fencing script is not executed. Database instance on primary is up and running without any event logged for crash. |
9 | Network error: Network (streaming and admin) is disrupted between primary and standby. |
Failed over, as expected. Arbitration and fencing scripts are executed, but unable to fence due to network failure. Split brain scenario is observed. |
Failed over, as expected Fencing script executed, as it goes through arbitration network. Split-brain scenario is prevented. |
10 | Network error: Arbitration (eth1) network is disrupted; bit streaming replication and admin network is configured on other network (eth0). |
Not applicable, as only one network is configurable, when there is no arbitration server. |
No failover, as expected. Fencing script is not executed. |
Test results with Mirroring Controller but without arbitration server
We tested the following use case scenarios with Mirroring Controller without an arbitration server. Logs of the execution are provided for reference.
Scenario 1: Primary is stopped; Linux OS is shut down
Result - Server status: Failed over, as expected. Arbitration_cmd and Fencing_cmd are executed.
[fepuser@fep15-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
-----------------------------------------------------------------------------------
server1 primary 192.168.100.5 normal abnormal(wal_sender) normal
server2 unknown 192.168.100.6 abnormal unknown unknown
[fepuser@fep15-ha-server1: ~]$
[fepuser@fep15-ha-server1: ~]$
[fepuser@fep15-ha-server1: ~]$ cat /mc/hb_chg.log
Arbitration Command executed at: Thu Feb 1 16:58:58 AEDT 2024
Fencing Command executed at: Thu Feb 1 16:58:58 AEDT 2024
[fepuser@fep15-ha-server1: ~]$
Standby (new primary) Mirroring Controller log
Feb 1 16:58:55 fep15-ha-server1 MirroringControllerOpen[6225]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 1 16:58:56 fep15-ha-server1 MirroringControllerOpen[6225]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb 1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: starting degenerate because of result of arbitration command for database server "server2" result:"0" (MCA00193)
Feb 1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: executing fencing command (MCA00192)
Feb 1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: fencing command for database server "server2" succeeded: result:"0" (MCA00198)
Feb 1 16:59:00 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: promotion processing completed (MCA00062)
Standby (new primary) database log
2024-02-01 16:58:58.719 AEDT [6256] LOG: received promote request
2024-02-01 16:58:58.720 AEDT [6259] FATAL: terminating walreceiver process due to administrator command
2024-02-01 16:58:58.720 AEDT [6256] LOG: invalid record length at 0/18000060: wanted 24, got 0
2024-02-01 16:58:58.720 AEDT [6256] LOG: redo done at 0/18000028 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 97.78 s
2024-02-01 16:58:58.758 AEDT [6256] LOG: selected new timeline ID: 5
2024-02-01 16:58:59.587 AEDT [6256] LOG: archive recovery complete
2024-02-01 16:59:00.178 AEDT [6254] LOG: checkpoint starting: force
2024-02-01 16:59:00.180 AEDT [6252] LOG: database system is ready to accept connections
2024-02-01 16:59:00.917 AEDT [6252] LOG: received SIGHUP, reloading configuration files
2024-02-01 16:59:03.274 AEDT [6254] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.194 s, total=3.097 s; sync files=2, longest=0.099 s, average=0.097 s; distance=16384 kB, estimate=16384 kB
Scenario 2: Primary database instance is brought down using pg_ctl stop
Result: Failed over, as expected. Arbitration_cmd and Fencing_cmd is not executed.
[fepuser@fep15-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------------------------------
server1 primary 192.168.100.5 normal abnormal(wal_sender) normal
server2 none(inactivated primary) 192.168.100.6 normal abnormal(postmaster) normal
[fepuser@fep15-ha-server1: ~]$
Primary database log
2024-02-01 16:30:13.928 AEDT [356137] LOG: received fast shutdown request
2024-02-01 16:30:13.934 AEDT [356137] LOG: aborting any active transactions
2024-02-01 16:30:13.935 AEDT [356146] LOG: user profile status writer shutting down
2024-02-01 16:30:13.940 AEDT [356137] LOG: background worker "user profile status writer" (PID 356146) exited with exit code 1
2024-02-01 16:30:13.940 AEDT [356137] LOG: background worker "logical replication launcher" (PID 356147) exited with exit code 1
2024-02-01 16:30:13.941 AEDT [356139] LOG: shutting down
2024-02-01 16:30:13.986 AEDT [356139] LOG: checkpoint starting: shutdown immediate
2024-02-01 16:30:14.018 AEDT [356139] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.046 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB
2024-02-01 16:30:14.056 AEDT [356137] LOG: database system is shut down
Primary Mirroring Controller log
Feb 1 16:30:14 fep15-ha-server2 MirroringControllerOpen[356515]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb 1 16:30:14 fep15-ha-server2 MirroringControllerOpen[356515]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb 1 16:30:14 fep15-ha-server2 MirroringControllerOpen[356515]: LOG: fail over completed.switched from server2 to server1 (MCA00022)
Standby Mirroring Controller log
Feb 1 16:30:14 fep15-ha-server1 MirroringControllerOpen[373364]: LOG: promotion processing completed (MCA00062)
Scenario 3: Postmaster process is killed on primary
Result: Failed over, as expected. Arbitration_cmd and Fencing_cmd is not executed.
[fepuser@fep15-ha-server1: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------------------------------
server1 none(inactivated primary) 192.168.100.5 normal abnormal(postmaster) normal
server2 primary 192.168.100.6 normal abnormal(wal_sender) normal
[fepuser@fep15-ha-server1: /mc]$
Primary Mirroring Controller log
Feb 1 16:13:12 fep15-ha-server1 MirroringControllerOpen[371058]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb 1 16:13:12 fep15-ha-server1 MirroringControllerOpen[371058]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb 1 16:13:13 fep15-ha-server1 MirroringControllerOpen[371058]: LOG: fail over completed.switched from server1 to server2 (MCA00022)
Standby Mirroring Controller log
Feb 1 16:13:13 fep15-ha-server2 MirroringControllerOpen[352280]: LOG: promotion processing completed (MCA00062)
Scenario 4: Storage error - Data Storage full on primary
Result: Failed over, as expected. Arbitration_cmd and Fencing_cmd is not executed.
Test case scenario execution log
[fepuser@fep15-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
--------------------------------------------------------------------------------------------------------------------------
server1 none(inactivated primary) 192.168.100.5 normal abnormal(postmaster) abnormal(data,tran_log,tablespace)
server2 primary 192.168.100.6 normal abnormal(wal_sender) normal
[fepuser@fep15-ha-server2: ~]$
Primary log
Feb 1 11:29:11 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "data storage destination directory(/database/inst1)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 1 11:29:11 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "transaction log storage destination directory(/database/inst1/pg_wal)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 1 11:29:11 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "tablespace directory(/pg_tblspc/enc_tblspc)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 1 11:29:11 fep15-ha-server1 rsyslogd[888]: imjournal: fopen() failed for path: '/var/lib/rsyslog/imjournal.state.tmp': No space left on device [v8.2102.0-5.el8 try https://www.rsyslog.com/e/2013 ]
Feb 1 11:29:12 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "tablespace directory(/pg_tblspc/enc_tblspc)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "tablespace directory(/pg_tblspc/enc_tblspc)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb 1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: LOG: stopped database instance forcibly (MCA00064)
Feb 1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: LOG: fail over completed.switched from server1 to server2 (MCA00022)
Standby (new primary) log
Feb 1 11:29:13 fep15-ha-server2 MirroringControllerOpen[376691]: LOG: promotion processing completed (MCA00062)
Primary database log
2024-02-01 11:27:30.462 AEDT [9876] ERROR: could not create file "base/5/16522_vm": No space left on device
2024-02-01 11:27:30.462 AEDT [9876] CONTEXT: COPY pgbench_accounts, line 1000
2024-02-01 11:27:30.462 AEDT [9876] STATEMENT: copy pgbench_accounts from stdin with (freeze on)
2024-02-01 11:28:10.854 AEDT [5984] LOG: checkpoint starting: time
2024-02-01 11:28:13.244 AEDT [5984] LOG: checkpoint complete: wrote 24 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=2.319 s, sync=0.032 s, total=2.390 s; sync files=24, longest=0.021 s, average=0.002 s; distance=202 kB, estimate=202 kB
2024-02-01 11:29:09.692 AEDT [10377] WARNING: could not extend file "base/5/16538", but retrying: No space left on device
2024-02-01 11:29:09.692 AEDT [10377] CONTEXT: COPY pgbench_accounts, line 2000
2024-02-01 11:29:13.121 AEDT [5982] LOG: received immediate shutdown request
2024-02-01 11:29:13.136 AEDT [5982] LOG: database system is shut down
~
Scenario 5: Storage error - WAL directory full on primary
[fepuser@fep15-ha-server2: /run/pg_wal]$ df -h .
Filesystem Size Used Avail Use% Mounted on
tmpfs 1.9G 1.9G 1.9M 100% /run
[fepuser@fep15-ha-server2: /run/pg_wal]$
[fepuser@fep15-ha-server1: /database/inst1/log]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------------------------------
server1 primary 192.168.100.5 normal abnormal(wal_sender) normal
server2 none(inactivated primary) 192.168.100.6 normal abnormal(postmaster) normal
[fepuser@fep15-ha-server1: /database/inst1/log]$
Primary database log
2024-02-01 12:32:49.594 AEDT [401613] PANIC: could not write to file "pg_wal/xlogtemp.401613": No space left on device
2024-02-01 12:32:49.594 AEDT [401613] CONTEXT: writing block 4846 of relation base/5/24714
COPY pgbench_accounts, line 421000
2024-02-01 12:32:49.594 AEDT [401613] STATEMENT: copy pgbench_accounts from stdin with (freeze on)
2024-02-01 12:32:50.099 AEDT [400627] LOG: server process (PID 401613) was terminated by signal 6: Aborted
2024-02-01 12:32:50.099 AEDT [400627] DETAIL: Failed process was running: copy pgbench_accounts from stdin with (freeze on)
2024-02-01 12:32:50.099 AEDT [400627] LOG: terminating any other active server processes
2024-02-01 12:32:50.102 AEDT [400627] LOG: all server processes terminated; reinitializing
2024-02-01 12:32:50.202 AEDT [400627] LOG: auto-open keystore has been opened
2024-02-01 12:32:50.203 AEDT [401627] LOG: database system was interrupted; last known up at 2024-02-01 12:29:44 AEDT
2024-02-01 12:32:50.206 AEDT [401630] FATAL: the database system is in recovery mode
2024-02-01 12:32:50.493 AEDT [401633] FATAL: the database system is in recovery mode
2024-02-01 12:32:50.500 AEDT [401634] FATAL: the database system is in recovery mode
2024-02-01 12:32:50.507 AEDT [401635] FATAL: the database system is in recovery mode
2024-02-01 12:32:50.514 AEDT [401636] FATAL: the database system is in recovery mode
2024-02-01 12:32:51.248 AEDT [401627] LOG: database system was not properly shut down; automatic recovery in progress
2024-02-01 12:32:51.254 AEDT [401627] LOG: redo starts at 0/140000A0
2024-02-01 12:32:51.319 AEDT [401627] LOG: redo done at 0/15FFEC50 system usage: CPU: user: 0.05 s, system: 0.01 s, elapsed: 0.06 s
2024-02-01 12:32:51.323 AEDT [401639] FATAL: the database system is not yet accepting connections
2024-02-01 12:32:51.323 AEDT [401639] DETAIL: Consistent recovery state has not been yet reached.
2024-02-01 12:32:51.331 AEDT [401627] FATAL: could not write to file "pg_wal/xlogtemp.401627": No space left on device
2024-02-01 12:32:51.332 AEDT [401640] FATAL: the database system is not yet accepting connections
2024-02-01 12:32:51.332 AEDT [401640] DETAIL: Consistent recovery state has not been yet reached.
2024-02-01 12:32:51.334 AEDT [400627] LOG: startup process (PID 401627) exited with exit code 1
2024-02-01 12:32:51.334 AEDT [400627] LOG: terminating any other active server processes
2024-02-01 12:32:51.339 AEDT [401641] FATAL: the database system is in recovery mode
2024-02-01 12:32:51.341 AEDT [400627] LOG: shutting down due to startup process failure
2024-02-01 12:32:51.348 AEDT [400627] LOG: database system is shut down
Primary Mirroring Controller log
Feb 1 12:32:50 fep15-ha-server2 MirroringControllerOpen[400600]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL: the database system is in recovery mode#012 (MCA00019)
Feb 1 12:32:51 fep15-ha-server2 MirroringControllerOpen[400600]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL: the database system is in recovery mode#012 (MCA00019)
Feb 1 12:32:51 fep15-ha-server2 MirroringControllerOpen[400600]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb 1 12:32:51 fep15-ha-server2 MirroringControllerOpen[400600]: LOG: fail over completed.switched from server2 to server1 (MCA00022)
Standby Mirroring Controller log
Feb 1 12:32:51 fep15-ha-server1 MirroringControllerOpen[13212]: LOG: promotion processing completed (MCA00062)
Scenario 6: Mirroring Controler is stopped on primary using mc_ctl stop -M /mc
Result: No Failover, as expected. Arbitration_cmd and Fencing_cmd is not executed.
Primary Mirroring Controller log
Feb 1 18:05:50 fep15-ha-server1 MirroringControllerOpen[11781]: LOG: stopping Mirroring Controller (MCA00041)
Feb 1 18:05:50 fep15-ha-server1 MirroringControllerOpen[11781]: LOG: Mirroring Controller stopped target server:"server1" (MCA00042)
Standby Mirroring Controller log
No event is recorded on the standby Mirroring Controller log.
Primary database log
2024-02-01 18:05:50.163 AEDT [11687] LOG: received fast shutdown request
2024-02-01 18:05:50.169 AEDT [11687] LOG: aborting any active transactions
2024-02-01 18:05:50.170 AEDT [11696] LOG: user profile status writer shutting down
2024-02-01 18:05:50.174 AEDT [11687] LOG: background worker "user profile status writer" (PID 11696) exited with exit code 1
2024-02-01 18:05:50.174 AEDT [11687] LOG: background worker "logical replication launcher" (PID 11697) exited with exit code 1
2024-02-01 18:05:50.175 AEDT [11689] LOG: shutting down
2024-02-01 18:05:50.275 AEDT [11689] LOG: checkpoint starting: shutdown immediate
2024-02-01 18:05:50.370 AEDT [11689] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.002 s, sync=0.015 s, total=0.118 s; sync files=2, longest=0.009 s, average=0.008 s; distance=16384 kB, estimate=16384 kB
2024-02-01 18:05:50.703 AEDT [11687] LOG: database system is shut down
Scenario 7: Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e
Result: No Failover, as expected. Arbitration_cmd and Fencing_cmd is not executed. Database instance on primary is up and running.
[fepuser@fep15-ha-server1: /mc]$ mc_ctl stop -M /mc/ -e
stopping Mirroring Controller forcibly (MCA00044)
stopped Mirroring Controller forcibly (MCA00045)
[fepuser@fep15-ha-server1: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown
server_id host_role host host_status db_proc_status disk_status
-----------------------------------------------------------------------------
server1 unknown 192.168.100.5 normal unknown unknown
server2 standby 192.168.100.6 normal normal normal
[fepuser@fep15-ha-server1: /mc]$
[fepuser@fep15-ha-server1: /mc]$ pg_ctl -D /database/inst1/ status
pg_ctl: server is running (PID: 6252)
/opt/fsepv15server64/bin/postgres "-D" "/database/inst1"
[fepuser@fep15-ha-server1: /mc]$
Primary database log
No related event in the database log
Primary Mirroring Controller log
Feb 1 17:11:12 fep15-ha-server1 MirroringControllerOpen[10086]: LOG: stopping Mirroring Controller forcibly (MCA00044)
Feb 1 17:11:12 fep15-ha-server1 MirroringControllerOpen[10086]: LOG: stopped Mirroring Controller forcibly (MCA00045)
Standby Mirroring Controller log
No related event in the Mirroring Controller log.
Scenario 8: Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running
Result: No Failover, as expected. Arbitration_cmd and Fencing_cmd is not executed. Database instance on primary is up and running.
[fepuser@fep15-ha-server1: ~]$ date; kill -9 10992 10979
Thu Feb 1 17:42:01 AEDT 2024
[fepuser@fep15-ha-server1: ~]$
[fepuser@fep15-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown
server_id host_role host host_status db_proc_status disk_status
-----------------------------------------------------------------------------
server1 unknown 192.168.100.5 normal unknown unknown
server2 standby 192.168.100.6 normal normal normal
[fepuser@fep15-ha-server1: ~]$
Mirroring Controller log
No event is recorded in the Mirroring Controller log of primary or standby servers.
Scenario 9: Network error: Network is disrupted between primary and standby - IP address is brought down using ifconfig eth1 down)
Note: Mirroring Controller and replication are running on the same network
Result - Server status: Failed over, as expected. Arbitration_cmd and Fencing_cmd are executed. Split-brain scenario visible.
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------
server1 primary 10.1.254.21 normal normal normal
server2 standby 10.1.254.22 normal normal normal
[fsepuser@fep15-rhel8-ha-server1: ~]$
[fsepuser@fep15-rhel8-ha-server1: ~]$ ps -ef | grep fsepuser
root 445079 1518 0 12:12 pts/0 00:00:00 su - fsepuser
fsepuser 445080 445079 0 12:12 pts/0 00:00:00 -bash
fsepuser 445260 1 0 12:12 pts/0 00:00:00 mc_keeper start -M /mc
fsepuser 445273 445260 0 12:12 pts/0 00:00:02 mc_agent start -M /mc
fsepuser 445300 1 0 12:12 ? 00:00:00 /opt/fsepv15server64/bin/postgres -D /database/inst1
fsepuser 445301 445300 0 12:12 ? 00:00:00 postgres: logger
fsepuser 445302 445300 0 12:12 ? 00:00:00 postgres: checkpointer
fsepuser 445303 445300 0 12:12 ? 00:00:00 postgres: background writer
fsepuser 445305 445300 0 12:12 ? 00:00:00 postgres: walwriter
fsepuser 445306 445300 0 12:12 ? 00:00:00 postgres: autovacuum launcher
fsepuser 445307 445300 0 12:12 ? 00:00:00 postgres: archiver last was 000000090000000000000037.00000028.backup
fsepuser 445308 445300 0 12:12 ? 00:00:00 postgres: logical replication launcher
fsepuser 445997 445300 0 12:15 ? 00:00:00 postgres: walsender repluser 10.1.0.21(60824) streaming 0/38000328
fsepuser 446907 445080 0 12:18 pts/0 00:00:00 ps -ef
fsepuser 446908 445080 0 12:18 pts/0 00:00:00 grep --color=auto fsepuser
[fsepuser@fep15-rhel8-ha-server1: ~]$
------------------------------------------------------
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------
server1 primary 10.1.254.21 normal normal normal
server2 standby 10.1.254.22 normal normal normal
[fsepuser@fep15-rhel8-ha-server2: ~]$
[fsepuser@fep15-rhel8-ha-server2: ~]$ psql
psql (15.0)
Type "help" for help.
postgres=# SHOW primary_conninfo ;
primary_conninfo
-------------------------------------------------------------------------------
user=repluser password=Secret host=10.1.0.20 port=27500 application_name=stby
(1 row)
postgres=#
[fsepuser@fep15-rhel8-ha-server2: ~]$ ps -ef | grep fsepuser
root 1556 1532 0 Feb06 pts/0 00:00:00 su - fsepuser
fsepuser 1557 1556 0 Feb06 pts/0 00:00:00 -bash
fsepuser 557223 1 0 12:15 pts/0 00:00:00 mc_keeper start -M /mc
fsepuser 557236 557223 1 12:15 pts/0 00:00:02 mc_agent start -M /mc
fsepuser 557263 1 0 12:15 ? 00:00:00 /opt/fsepv15server64/bin/postgres -D /database/inst1
fsepuser 557264 557263 0 12:15 ? 00:00:00 postgres: logger
fsepuser 557265 557263 0 12:15 ? 00:00:00 postgres: checkpointer
fsepuser 557266 557263 0 12:15 ? 00:00:00 postgres: background writer
fsepuser 557267 557263 0 12:15 ? 00:00:00 postgres: startup recovering 000000090000000000000038
fsepuser 557268 557263 0 12:15 ? 00:00:00 postgres: walreceiver streaming 0/38000328
fsepuser 558101 1557 0 12:18 pts/0 00:00:00 ps -ef
fsepuser 558102 1557 0 12:18 pts/0 00:00:00 grep --color=auto fsepuser
[fsepuser@fep15-rhel8-ha-server2: ~]$
===================================================================
[root@fep15-rhel8-ha-server1 ~]# ifconfig eth1 down
[root@fep15-rhel8-ha-server1 ~]# date
Thu Feb 8 12:19:33 AEDT 2024
[root@fep15-rhel8-ha-server1 ~]#
[fsepuser@fep15-rhel8-ha-server1: /mc]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.1.0.20 netmask 255.255.255.0 broadcast 10.1.0.255
inet6 fe80::222:48ff:fe7e:cac7 prefixlen 64 scopeid 0x20<link&>
ether 00:22:48:7e:ca:c7 txqueuelen 1000 (Ethernet)
RX packets 465126 bytes 134442811 (128.2 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 570268 bytes 327462701 (312.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNINGgt; mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 1549836 bytes 161595079 (154.1 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1549836 bytes 161595079 (154.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[fsepuser@fep15-rhel8-ha-server1: /mc]$
===================================================================
[fsepuser@fep15-rhel8-ha-server1: /mc]$ mc_ctl status -M /mc/
mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)
[fsepuser@fep15-rhel8-ha-server1: /mc]$
[fsepuser@fep15-rhel8-ha-server1: /mc]$ cat hb_chg.log
Arbitration Command executed at: Thu Feb 8 12:19:32 AEDT 2024
Fencing Command executed at: Thu Feb 8 12:19:32 AEDT 2024
[fsepuser@fep15-rhel8-ha-server1: /mc]$
------------------------------------------------
[fsepuser@fep15-rhel8-ha-server2: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------------
server1 unknown 10.1.254.21 abnormal unknown unknown
server2 primary 10.1.254.22 normal abnormal(wal_sender) normal
[fsepuser@fep15-rhel8-ha-server2: /mc]$ cat hb_chg.log
Arbitration Command executed at: Thu Feb 8 12:19:32 AEDT 2024
Fencing Command executed at: Thu Feb 8 12:19:32 AEDT 2024
[fsepuser@fep15-rhel8-ha-server2: /mc]$
Server1 Mirroring Controller log
Feb 8 12:19:28 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 8 12:19:30 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb 8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: starting degenerate because of result of arbitration command for database server "server2" result:"0" (MCA00193)
Feb 8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: starting to detach standby server "server2" automatically (MCA00024)
Feb 8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: executing fencing command (MCA00192)
Feb 8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: fencing command for database server "server2" succeeded: result:"0" (MCA00198)
Feb 8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: detach standby server "server2" completed automatically (MCA00025)
Feb 8 12:20:47 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: fe_sendauth: no password supplied#012 (MCA00019)
Feb 8 12:21:09 fep15-rhel8-ha-server1 MirroringControllerOpen[447660]: ERROR: mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)
Server2 Mirroring Controller log
Feb 8 12:19:28 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 8 12:19:30 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb 8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: starting degenerate because of result of arbitration command for database server "server1" result:"0" (MCA00193)
Feb 8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: executing fencing command (MCA00192)
Feb 8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: fencing command for database server "server1" succeeded: result:"0" (MCA00198)
Feb 8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: promotion processing completed (MCA00062)
Primary database log
2024-02-08 12:19:32.252 AEDT [445997] FATAL: terminating connection due to administrator command
2024-02-08 12:19:32.252 AEDT [445997] STATEMENT: START_REPLICATION SLOT "repl_slot1" 0/38000000 TIMELINE 9 COMPRESS 0
2024-02-08 12:19:32.264 AEDT [445300] LOG: received SIGHUP, reloading configuration files
2024-02-08 12:19:32.264 AEDT [445300] LOG: parameter "synchronous_standby_names" removed from configuration file, reset to default
Standby database log
2024-02-08 12:19:32.252 AEDT [557268] FATAL: could not receive data from WAL stream: FATAL: terminating connection due to administrator command
2024-02-08 12:19:32.252 AEDT [557267] LOG: invalid record length at 0/38000410: wanted 24, got 0
2024-02-08 12:19:32.279 AEDT [558371] LOG: started streaming WAL from primary at 0/38000000 on timeline 9
2024-02-08 12:19:32.495 AEDT [557267] LOG: received promote request
2024-02-08 12:19:32.495 AEDT [558371] FATAL: terminating walreceiver process due to administrator command
2024-02-08 12:19:32.495 AEDT [557267] LOG: waiting for WAL to become available at 0/38000428
2024-02-08 12:19:32.495 AEDT [557267] LOG: redo done at 0/380003D8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 247.21 s
2024-02-08 12:19:32.495 AEDT [557267] LOG: last completed transaction was at log time 2024-02-08 12:16:30.382893+11
2024-02-08 12:19:32.502 AEDT [557267] LOG: selected new timeline ID: 10
2024-02-08 12:19:32.578 AEDT [557267] LOG: archive recovery complete
2024-02-08 12:19:32.594 AEDT [557265] LOG: checkpoint starting: force
2024-02-08 12:19:32.597 AEDT [557263] LOG: database system is ready to accept connections
2024-02-08 12:19:32.606 AEDT [557263] LOG: received SIGHUP, reloading configuration files
2024-02-08 12:19:32.723 AEDT [557265] LOG: checkpoint complete: wrote 5 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.105 s, sync=0.007 s, total=0.130 s; sync files=4, longest=0.003 s, average=0.002 s; distance=16385 kB, estimate=16385 kB
===================================================================
[root@fep15-rhel8-ha-server1 ~]# ifconfig eth1 up
[root@fep15-rhel8-ha-server1 ~]# date
Thu Feb 8 12:34:06 AEDT 2024
[root@fep15-rhel8-ha-server1 ~]#
[root@fep15-rhel8-ha-server1 ~]# su - fsepuser
Last login: Thu Feb 8 12:20:36 AEDT 2024 on pts/0
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
not-switchable
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------------
server1 primary 10.1.254.21 normal abnormal(wal_sender) normal
server2 primary 10.1.254.22 normal abnormal(wal_sender) normal
[fsepuser@fep15-rhel8-ha-server1: ~]$
Scenario 10: This scenario is not applicable
This scenario is not applicable, as only one network is configurable, when there is no arbitration server.
Test results with Mirroring Controller and arbitration server
Scenario 1: Primary is stopped; Linux OS is shut down
Result - Server status: Failed over, as expected. Fencing_cmd is executed.
[fsepuser@fep15-rhel8-ha-server2: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
-------------------------------------------------------------------------------
server1 unknown 10.1.0.20 abnormal unknown unknown
server2 primary 10.1.0.21 normal abnormal(wal_sender) normal
[fsepuser@fep15-rhel8-ha-server2: /mc]$
[fsepuser@fep15-rhel8-ha-arbiter: ~]$ cat /mc/hb_chg.log
Fencing Command executed at: Mon Feb 5 01:21:13 UTC 2024
[fsepuser@fep15-rhel8-ha-arbiter: ~]$
Standby (new primary) Mirroring Controller log
Feb 5 01:21:08 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 5 01:21:10 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 5 01:21:12 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 5 01:21:12 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: requesting arbitration server "arbiter" to arbitrate for database server "server1" (MCA00159)
Feb 5 01:21:13 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: the arbitration server "arbiter" didn't arbitrate status of database server "server1" as sanity (MCA00197)
Feb 5 01:21:13 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: promotion processing completed (MCA00062)
Arbiter Mirroring Controller log
Feb 5 01:21:04 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server1" (MCR00076)
Feb 5 01:21:05 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: stop monitoring of database server "server1" (MCR00098)
Feb 5 01:21:09 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server1" (MCR00099)
Feb 5 01:21:11 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server1" (MCR00099)
Feb 5 01:21:12 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: database server "server2" requested to arbitrate for database server "server1" status (MCR00056)
Feb 5 01:21:12 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: arbitrating the status of database server "server1" (MCR00057)
Feb 5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server1" (MCR00099)
Feb 5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: could not receive response from database server server1 (MCR00059)
Feb 5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: executing fencing command (MCR00062)
Feb 5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: fencing command for database server "server1" succeeded: result:"0" (MCR00063)
Standby (new primary) database log
2024-02-05 12:21:03.896 AEDT [1293726] FATAL: could not receive data from WAL stream: FATAL: terminating connection due to administrator command
2024-02-05 12:21:03.897 AEDT [1293725] LOG: invalid record length at 0/1A0003D8: wanted 24, got 0
2024-02-05 12:21:03.977 AEDT [1295256] FATAL: could not connect to the primary server: connection to server at "10.1.254.21", port 27500 failed: FATAL: the database system is shutting down
2024-02-05 12:21:03.977 AEDT [1293725] LOG: waiting for WAL to become available at 0/1A0003F0
2024-02-05 12:21:13.078 AEDT [1293725] LOG: received promote request
2024-02-05 12:21:13.078 AEDT [1295284] FATAL: terminating walreceiver process due to administrator command
2024-02-05 12:21:13.078 AEDT [1293725] LOG: waiting for WAL to become available at 0/1A0003F0
2024-02-05 12:21:13.078 AEDT [1293725] LOG: redo done at 0/1A0003A0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 327.01 s
2024-02-05 12:21:13.078 AEDT [1293725] LOG: last completed transaction was at log time 2024-02-05 12:17:26.334037+11
2024-02-05 12:21:13.084 AEDT [1293725] LOG: selected new timeline ID: 4
2024-02-05 12:21:13.181 AEDT [1293725] LOG: archive recovery complete
2024-02-05 12:21:13.220 AEDT [1293723] LOG: checkpoint starting: force
2024-02-05 12:21:13.224 AEDT [1293721] LOG: database system is ready to accept connections
2024-02-05 12:21:13.247 AEDT [1293723] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.006 s, total=0.028 s; sync files=2, longest=0.004 s, average=0.003 s; distance=0 kB, estimate=14745 kB
2024-02-05 12:21:13.285 AEDT [1293721] LOG: received SIGHUP, reloading configuration files
Scenario 2: Primary database instance is brought down using pg_ctl stop
Result - Server status: Failed over, as expected.
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
-----------------------------------------------------------------------------------------------
server1 primary 10.1.0.20 normal abnormal(wal_sender) normal
server2 none(inactivated primary) 10.1.0.21 normal abnormal(postmaster) normal
[fsepuser@fep15-rhel8-ha-server2: ~]$
Primary database log
2024-02-05 12:44:10.521 AEDT [1293721] LOG: received fast shutdown request
2024-02-05 12:44:10.528 AEDT [1293721] LOG: aborting any active transactions
2024-02-05 12:44:10.533 AEDT [1293721] LOG: background worker "logical replication launcher" (PID 1295332) exited with exit code 1
2024-02-05 12:44:10.534 AEDT [1293723] LOG: shutting down
2024-02-05 12:44:10.667 AEDT [1293723] LOG: checkpoint starting: shutdown immediate
2024-02-05 12:44:10.718 AEDT [1293723] LOG: checkpoint complete: wrote 5 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.018 s, total=0.055 s; sync files=5, longest=0.016 s, average=0.004 s; distance=16383 kB, estimate=16384 kB
2024-02-05 12:44:10.805 AEDT [1301779] FATAL: the database system is shutting down
2024-02-05 12:44:10.811 AEDT [1301780] FATAL: the database system is shutting down
2024-02-05 12:44:10.816 AEDT [1301781] FATAL: the database system is shutting down
2024-02-05 12:44:10.822 AEDT [1301782] FATAL: the database system is shutting down
2024-02-05 12:44:10.935 AEDT [1293721] LOG: database system is shut down
Standby (New Primary) log
2024-02-05 12:44:10.931 AEDT [6723] LOG: replication terminated by primary server
2024-02-05 12:44:10.931 AEDT [6723] DETAIL: End of WAL reached on timeline 4 at 0/1D0000A0.
2024-02-05 12:44:10.931 AEDT [6723] FATAL: could not send end-of-streaming message to primary: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
2024-02-05 12:44:10.932 AEDT [6722] LOG: invalid record length at 0/1D0000A0: wanted 24, got 0
2024-02-05 12:44:10.940 AEDT [7336] FATAL: could not connect to the primary server: connection to server at "10.1.0.21", port 27500 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
2024-02-05 12:44:10.940 AEDT [6722] LOG: waiting for WAL to become available at 0/1D0000B8
2024-02-05 12:44:11.651 AEDT [6722] LOG: received promote request
2024-02-05 12:44:11.651 AEDT [6722] LOG: redo done at 0/1D000028 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 138.53 s
2024-02-05 12:44:11.651 AEDT [6722] LOG: last completed transaction was at log time 2024-02-05 12:42:56.459462+11
2024-02-05 12:44:11.657 AEDT [6722] LOG: selected new timeline ID: 5
2024-02-05 12:44:11.789 AEDT [6722] LOG: archive recovery complete
2024-02-05 12:44:11.825 AEDT [6720] LOG: checkpoint starting: force
2024-02-05 12:44:11.830 AEDT [6718] LOG: database system is ready to accept connections
2024-02-05 12:44:11.857 AEDT [6718] LOG: received SIGHUP, reloading configuration files
2024-02-05 12:44:11.980 AEDT [6720] LOG: checkpoint complete: wrote 5 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.125 s, sync=0.013 s, total=0.155 s; sync files=4, longest=0.006 s, average=0.004 s; distance=16384 kB, estimate=16384 kB
Primary Mirroring Controller log
Feb 5 01:44:10 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL: the database system is shutting down#012 (MCA00019)
Feb 5 01:44:11 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb 5 01:44:11 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb 5 01:44:11 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: fail over completed.switched from server2 to server1 (MCA00022)
Standby Mirroring Controller log
Feb 5 01:44:11 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: LOG: promotion processing completed (MCA00062)
Arbiter log
No event is recorded in the arbiter Mirroring Controller log
Scenario 3: Postmaster process is killed on primary
Result - Server status: Failed over, as expected
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
-----------------------------------------------------------------------------------------------
server1 none(inactivated primary) 10.1.0.20 normal abnormal(postmaster) normal
server2 primary 10.1.0.21 normal abnormal(wal_sender) normal
[fsepuser@fep15-rhel8-ha-server2: ~]$
Primary Mirroring Controller log
Feb 5 01:58:32 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb 5 01:58:32 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb 5 01:58:32 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: LOG: fail over completed.switched from server1 to server2 (MCA00022)
Standby Mirroring Controller log
Feb 5 01:58:32 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: promotion processing completed (MCA00062)
Arbiter log
No event is recorded in the arbiter Mirroring Controller log
Scenario 4: Storage error - Data Storage full on primary
Result - Server status: Failed over, as expected.
Test case scenario execution log
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
-----------------------------------------------------------------------------------------------------------
server1 primary 10.1.0.20 normal abnormal(wal_sender) normal
server2 none(inactivated primary) 10.1.0.21 normal abnormal(postmaster) abnormal(data,tran_log)
[fsepuser@fep15-rhel8-ha-server1: ~]$
Primary Mirroring Controller log
Feb 5 02:22:54 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "data storage destination directory(/database/inst1)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 5 02:22:54 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "transaction log storage destination directory(/database/inst1/pg_wal)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "data storage destination directory(/database/inst1)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "transaction log storage destination directory(/database/inst1/pg_wal)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb 5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb 5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: stopped database instance forcibly (MCA00064)
Feb 5 02:22:55 fep15-rhel8-ha-server2 rsyslogd[1051]: imjournal: fopen() failed for path: '/var/lib/rsyslog/imjournal.state.tmp': No space left on device [v8.2102.0-5.el8 try https://www.rsyslog.com/e/2013 ]
Feb 5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: fail over completed.switched from server2 to server1 (MCA00022)
Standby (new primary) Mirroring Controller log
Feb 5 02:22:55 fep15-rhel8-ha-server1 MirroringControllerOpen[11773]: LOG: promotion processing completed (MCA00062)
Arbiter log
No event is recorded in the arbiter Mirroring Controller log
Primary database log
2024-02-05 13:22:55.591 AEDT [1302622] LOG: received immediate shutdown request
2024-02-05 13:22:55.637 AEDT [1302622] LOG: database system is shut down
Standby (New primary) log
2024-02-05 13:22:55.628 AEDT [11808] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2024-02-05 13:22:55.629 AEDT [11807] LOG: invalid record length at 0/21000148: wanted 24, got 0
2024-02-05 13:22:55.643 AEDT [15753] FATAL: could not connect to the primary server: connection to server at "10.1.0.21", port 27500 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2024-02-05 13:22:55.643 AEDT [11807] LOG: waiting for WAL to become available at 0/21000160
2024-02-05 13:22:55.700 AEDT [11807] LOG: received promote request
2024-02-05 13:22:55.700 AEDT [11807] LOG: redo done at 0/21000110 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 853.66 s
2024-02-05 13:22:55.703 AEDT [11807] LOG: selected new timeline ID: 7
2024-02-05 13:22:55.795 AEDT [11807] LOG: archive recovery complete
2024-02-05 13:22:55.811 AEDT [11805] LOG: checkpoint starting: force
2024-02-05 13:22:55.814 AEDT [11803] LOG: database system is ready to accept connections
2024-02-05 13:22:55.845 AEDT [11805] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.009 s, total=0.035 s; sync files=2, longest=0.005 s, average=0.005 s; distance=0 kB, estimate=14745 kB
2024-02-05 13:22:55.909 AEDT [11803] LOG: received SIGHUP, reloading configuration files
Scenario 5: Storage error - WAL directory full on primary
Result - Server status: Failed over, as expected.
[fsepuser@fep15-rhel8-ha-server1: ~]$ df -h /mnt/resource/pg_wal/
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 7.9G 7.4G 0 100% /mnt/resource
[fsepuser@fep15-rhel8-ha-server1: ~]$
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
-----------------------------------------------------------------------------------------------
server1 none(inactivated primary) 10.1.0.20 normal abnormal(postmaster) normal
server2 primary 10.1.0.21 normal abnormal(wal_sender) normal
[fsepuser@fep15-rhel8-ha-server1: ~]$
Primary database log
2024-02-05 15:13:30.093 AEDT [21638] PANIC: could not write to file "pg_wal/xlogtemp.21638": No space left on device
2024-02-05 15:13:30.093 AEDT [21638] CONTEXT: writing block 6186 of relation base/5/16409
COPY pgbench_accounts, line 503000
2024-02-05 15:13:30.093 AEDT [21638] STATEMENT: copy pgbench_accounts from stdin with (freeze on)
2024-02-05 15:13:32.410 AEDT [19756] LOG: server process (PID 21638) was terminated by signal 6: Aborted
2024-02-05 15:13:32.410 AEDT [19756] DETAIL: Failed process was running: copy pgbench_accounts from stdin with (freeze on)
2024-02-05 15:13:32.410 AEDT [19756] LOG: terminating any other active server processes
2024-02-05 15:13:32.414 AEDT [19756] LOG: all server processes terminated; reinitializing
2024-02-05 15:13:33.285 AEDT [21674] LOG: database system was interrupted; last known up at 2024-02-05 15:13:26 AEDT
2024-02-05 15:13:33.288 AEDT [21678] FATAL: the database system is in recovery mode
2024-02-05 15:13:33.295 AEDT [21677] FATAL: the database system is in recovery mode
2024-02-05 15:13:33.298 AEDT [21679] FATAL: the database system is in recovery mode
2024-02-05 15:13:33.302 AEDT [21680] FATAL: the database system is in recovery mode
2024-02-05 15:13:33.308 AEDT [21681] FATAL: the database system is in recovery mode
2024-02-05 15:13:33.634 AEDT [21674] LOG: database system was not properly shut down; automatic recovery in progress
2024-02-05 15:13:33.640 AEDT [21674] LOG: redo starts at 0/25000338
2024-02-05 15:13:33.770 AEDT [21674] LOG: missing contrecord at 0/27FFEDF0
2024-02-05 15:13:33.770 AEDT [21674] LOG: redo done at 0/27FFEDB0 system usage: CPU: user: 0.06 s, system: 0.06 s, elapsed: 0.13 s
2024-02-05 15:13:33.772 AEDT [21674] FATAL: could not write to file "pg_wal/xlogtemp.21674": No space left on device
2024-02-05 15:13:33.778 AEDT [19756] LOG: startup process (PID 21674) exited with exit code 1
2024-02-05 15:13:33.778 AEDT [19756] LOG: terminating any other active server processes
2024-02-05 15:13:33.779 AEDT [19756] LOG: shutting down due to startup process failure 2024-02-05 15:13:33.788 AEDT [19756] LOG: database system is shut down
Primary Mirroring Controller log
Feb 5 04:13:33 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL: the database system is in recovery mode#012 (MCA00019)
Feb 5 04:13:34 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb 5 04:13:34 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb 5 04:13:34 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: LOG: fail over completed.switched from server1 to server2 (MCA00022)
Standby Mirroring Controller log
Feb 5 04:13:34 fep15-rhel8-ha-server2 MirroringControllerOpen[1310904]: LOG: promotion processing completed (MCA00062)
Arbiter log
No event is recorded in the arbiter Mirroring Controller log
Scenario 6: Mirroring Controler is stopped on primary using mc_ctl stop -M /mc
Result - Server status: No Failover
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------------
server1 standby 10.1.0.20 normal abnormal(wal_receiver) normal
server2 unknown 10.1.0.21 normal unknown unknown
[fsepuser@fep15-rhel8-ha-server2: ~]$
Primary Mirroring Controller log
Feb 5 04:52:38 fep15-rhel8-ha-server2 MirroringControllerOpen[1346327]: LOG: stopping Mirroring Controller (MCA00041)
Feb 5 04:52:38 fep15-rhel8-ha-server2 MirroringControllerOpen[1310904]: LOG: disconnected from arbitration server "arbiter" (MCA00153)
Feb 5 04:52:38 fep15-rhel8-ha-server2 MirroringControllerOpen[1346327]: LOG: Mirroring Controller stopped target server:"server2" (MCA00042)
Standby Mirroring Controller log
No event is recorded on the standby Mirroring Controller log.
Arbiter log
Feb 5 04:52:38 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server2" (MCR00076)
Feb 5 04:52:38 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: stop monitoring of database server "server2" (MCR00098)
Primary database log
2024-02-05 15:52:38.299 AEDT [1310934] LOG: received fast shutdown request
2024-02-05 15:52:38.303 AEDT [1310934] LOG: aborting any active transactions
2024-02-05 15:52:38.307 AEDT [1310934] LOG: background worker "logical replication launcher" (PID 1335548) exited with exit code 1
2024-02-05 15:52:38.309 AEDT [1310936] LOG: shutting down
2024-02-05 15:52:38.414 AEDT [1310936] LOG: checkpoint starting: shutdown immediate
2024-02-05 15:52:38.443 AEDT [1310936] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.034 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=33059 kB
2024-02-05 15:52:38.706 AEDT [1310934] LOG: database system is shut down
Scenario 7: Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e
Result - Database instance on primary is up and running without any event logged for Mirroring Controller crash. Server status: No failover
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl stop -M /mc/ -e
stopping Mirroring Controller forcibly (MCA00044)
stopped Mirroring Controller forcibly (MCA00045)
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown
server_id host_role host host_status db_proc_status disk_status
----------------------------------------------------------------------
server1 standby 10.1.0.20 normal normal normal
server2 unknown 10.1.0.21 normal unknown unknown
[fsepuser@fep15-rhel8-ha-server2: ~]$
[fsepuser@fep15-rhel8-ha-server2: ~]$ pg_ctl -D /database/inst1/ status
pg_ctl: server is running (PID: 1346548)
/opt/fsepv15server64/bin/postgres "-D" "/database/inst1"
[fsepuser@fep15-rhel8-ha-server2: ~]$
Primary database log
No related event in the database log
Primary Mirroring Controller log
Feb 5 05:00:23 fep15-rhel8-ha-server2 MirroringControllerOpen[1346890]: LOG: stopping Mirroring Controller forcibly (MCA00044)
Feb 5 05:00:23 fep15-rhel8-ha-server2 MirroringControllerOpen[1346890]: LOG: stopped Mirroring Controller forcibly (MCA00045)
Standby Mirroring Controller log
No related event in the Mirroring Controller log.
Arbiter log
Feb 5 05:00:23 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server2" (MCR00076)
Scenario 8: Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running
Result - Database instance on primary is up and running without any event logged for Mirroring Controller crash. Server status: No Failover
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown
server_id host_role host host_status db_proc_status disk_status
-------------------------------------------------------------------------
server1 standby 10.1.0.20 normal normal normal
server2 unknown 10.1.0.21 normal unknown unknown
[fsepuser@fep15-rhel8-ha-server2: ~]$
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown
server_id host_role host host_status db_proc_status disk_status
-------------------------------------------------------------------------
server1 standby 10.1.0.20 normal normal normal
server2 unknown 10.1.0.21 normal unknown unknown
[fsepuser@fep15-rhel8-ha-server1: ~]$
Mirroring Controller log
No event is recorded in the Mirroring Controller log of primary or standby servers.
Arbiter log
Feb 5 05:37:40 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server2" (MCR00076)
Scenario 9: Network error: Network is disrupted between primary and standby - IP address is brought down using ifconfig eth1 down)
Note: Admin and Arbitration are running on separate networks
Result - Network error: Network disrupted (IP address is taken down using ‘ifconfig eth1 down’). Server status: Failed over. Fencing_cmd is executed. Split-brain scenario is prevented.
[root@fep15-rhel8-ha-server1 ~]# ifconfig eth1 down
[root@fep15-rhel8-ha-server1 ~]#
[fsepuser@fep15-rhel8-ha-server2: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched
server_id host_role host host_status db_proc_status disk_status
---------------------------------------------------------------------------------
server1 unknown 10.1.254.21 abnormal unknown unknown
server2 primary 10.1.254.22 normal abnormal(wal_sender) normal
[fsepuser@fep15-rhel8-ha-server2: /mc]$
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/ --arbiter
arbiter_id host status
-------------------------------------
arbiter 10.1.254.23 online
[fsepuser@fep15-rhel8-ha-server1: ~]$
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)
[fsepuser@fep15-rhel8-ha-server1: ~]$
Primary Mirroring Controller log
Feb 8 11:29:41 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 8 11:29:43 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb 8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb 8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: starting degenerate because of result of arbitration command for database server "server2" result:"0" (MCA00193)
Feb 8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: starting to detach standby server "server2" automatically (MCA00024)
Feb 8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: executing fencing command (MCA00192)
Feb 8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: fencing command for database server "server2" succeeded: result:"0" (MCA00198)
Feb 8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: detach standby server "server2" completed automatically (MCA00025)
Feb 8 11:30:27 fep15-rhel8-ha-server1 MirroringControllerOpen[433813]: ERROR: mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)
Standby Mirroring Controller log
Feb 8 11:29:42 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 8 11:29:44 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb 8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb 8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: starting degenerate because of result of arbitration command for database server "server1" result:"0" (MCA00193)
Feb 8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: executing fencing command (MCA00192)
Feb 8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: fencing command for database server "server1" succeeded: result:"0" (MCA00198)
Feb 8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: promotion processing completed (MCA00062)
Scenario 10: Network error: Streaming network is disrupted; bit streaming replication is configured on second network and is disrupted
Note: Mirroring Controller and replication are running on different networks.
Result - Network error: Arbitration network is disrupted. Arbitration network is configured on second network and is disrupted. No failover. Fencing script is not executed. Server status: No Failover
[root@fep15-rhel8-ha-server2 ~]# date; ifconfig eth1 down
Mon Feb 5 17:17:14 AEDT 2024
[root@fep15-rhel8-ha-server2 ~]#
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable
server_id host_role host host_status db_proc_status disk_status
-------------------------------------------------------------------------
server1 standby 10.1.0.20 normal normal normal
server2 primary 10.1.0.21 normal normal normal
[fsepuser@fep15-rhel8-ha-server2: ~]$
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable
server_id host_role host host_status db_proc_status disk_status
-------------------------------------------------------------------------
server1 standby 10.1.0.20 normal normal normal
server2 primary 10.1.0.21 normal normal normal
[fsepuser@fep15-rhel8-ha-server1: ~]$
[fsepuser@fep15-rhel8-ha-arbiter: ~]$ mc_arb status -M /mc
server_id host status
-------------------------------------
server1 10.1.254.21 online
server2 10.1.254.22 offline
[fsepuser@fep15-rhel8-ha-arbiter: ~]$
Primary Mirroring Controller log
Feb 5 06:17:39 fep15-rhel8-ha-server2 MirroringControllerOpen[1354449]: ERROR: timeout waiting for communication with Mirroring Controller Arbitration process server:"arbiter" (MCA00154) Feb 5 06:17:39 fep15-rhel8-ha-server2 MirroringControllerOpen[1354449]: LOG: trying to connect to arbitration server "arbiter" (MCA00150)
Standby database log
No event recorded in the standby Mirroring Controller log.
Arbiter log
Feb 5 06:17:16 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server2" (MCR00099) Feb 5 06:17:18 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server2" (MCR00099) Feb 5 06:17:20 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server2" (MCR00099) Feb 5 06:17:39 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]:
disconnected from the database server "server2" (MCR00076) Feb 5 06:17:39 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: ERROR: unexpected error occurred: class java.lang.RuntimeException: java.net.SocketTimeoutException#012#011at com.fujitsu.fepmc.net.rpc.RPCSocketProcessor.run(Unknown Source)#012#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)#012#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)#012#011at java.lang.Thread.run(Thread.java:748)#012 (MCR00030)