Ensuring business continuity: The critical role of Mirroring Controller in high-availability systems

img-blog-curtain-author-rajesh-kandasamy-blue-to-cyan

Mirroring Controller ensures business continuity by seamlessly switching to a standby server in case of abnormalities detected on the primary server.

A customer keen on using Mirroring Controller enquired about its usage without an arbitration server. So the Center of Excellence team went to work and tested 10 different scenarios of a Fujitsu Enterprise Postgres environment running without an arbitration server, and then performed the test of the same 10 scenarios in an environment with an artibtation server.

The testing highlighted the effectiveness of the arbitration server in preventing split brain scenarios - more on this later.

But before proceeding, let me take a step back and explain some of the Fujitsu Enterprise Postgres components discussed here.

What is Mirroring Controller?

The Mirroring Controller is a module within Fujitsu Enterprise Postgres that continually monitors the system to switch online transaction processing to a standby server seamlessly in case an abnormality is detected.

This feature ensures business continuity and is a must-have for high-availability, mission-critical systems, which by definition require maximum uptime. It accomplishes this through system monitoring and automatic switch/disconnect.

img-dgm-mirroring-controller-02

Read more in our blog post
Mirroring Controller: a key to business continuity

What is the arbitration server?

Fujitsu Enterprise Postgres provides the Server Assistant component, which determines the status of database servers as a third party, and if necessary, isolates affected databases. The server running the Server Assistant - the arbitration server - evaluates the status of the primary server reported by the standby server (or vice versa), and takes the appropriate action, such as fencing the malfunctioning server.

img-dgm-mirroring-controller-03

To learn more about the arbitration server, check the General Description > High reliability with database multiplexing

How does the arbitration server prevent split brain scenarios?

The term split brain refers to a scenario when multiple nodes within a cluster simultaneously assume they are the primary node. This situation can arise when a backup server transitions to primary status while the original primary server recovers and reconnects.

If each server considers it is active and commits any transactions that the other does not, then you cannot easily resynchronize the databases and merge the information on each server.

The Mirroring Controller arbitration server plays a vital role in addressing this issue. Within a master-standby configuration, it monitors server health and intervenes when detecting anomalies, effectively preventing data inconsistency between nodes.

Test case scenarios

Here are the scenarios that we tested, first without an arbitration server, and then with an arbitration server.

#	Test case scenario	Result without arbitration server	Result with arbitration server
1	Primary is stopped; Linux OS is shut down	Failed over, as expected. Arbitration_cmd and Fencing_cmd are executed.	Failed over, as expected. Fencing_cmd is executed.
2	Primary database instance is brought down using pg_ctl stop	Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed.	Failed over, as expected. Fencing_cmd is not executed.
3	Postmaster process is killed on primary	Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed.	Failed over, as expected. Fencing_cmd is not executed.
4	Storage error: Data storage full on primary	Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed.	Failed over, as expected. Fencing_cmd is not executed.
5	Storage error: WAL directory full on primary	Failed over, as expected. Arbitration_cmd and Fencing_cmd are not executed.	Failed over, as expected. Fencing_cmd i not executed.
6	Mirroring Controller is stopped on primary using mc_ctl stop -M /mc	No failover, as expected. Arbiter and fencing scripts are not executed.	No failover, as expected. Fencing_cmd is not executed.
7	Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e	No failover, as expected. Arbiter and fencing scripts are not executed. Database instance on primary is up and running.	No failover, as expected. Fencing script is not executed. Database instance on primary is up and running without any event logged for crash.
8	Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running	No failover, as expected. Arbiter and fencing scripts are not executed. Database instance on primary is up and running.	No failover, as expected. Fencing script is not executed. Database instance on primary is up and running without any event logged for crash.
9	Network error: Network (streaming and admin) is disrupted between primary and standby.	Failed over, as expected. Arbitration and fencing scripts are executed, but unable to fence due to network failure. Split brain scenario is observed.	Failed over, as expected Fencing script executed, as it goes through arbitration network. Split-brain scenario is prevented.
10	Network error: Arbitration (eth1) network is disrupted; bit streaming replication and admin network is configured on other network (eth0).	Not applicable, as only one network is configurable, when there is no arbitration server.	No failover, as expected. Fencing script is not executed.

Test results with Mirroring Controller but without arbitration server

We tested the following use case scenarios with Mirroring Controller without an arbitration server. Logs of the execution are provided for reference.

Scenario 1: Primary is stopped; Linux OS is shut down

Result - Server status: Failed over, as expected. Arbitration_cmd and Fencing_cmd are executed.

[fepuser@fep15-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role  host           host_status  db_proc_status        disk_status
-----------------------------------------------------------------------------------
server1    primary    192.168.100.5  normal       abnormal(wal_sender)  normal
server2    unknown    192.168.100.6  abnormal     unknown               unknown
[fepuser@fep15-ha-server1: ~]$
[fepuser@fep15-ha-server1: ~]$
[fepuser@fep15-ha-server1: ~]$ cat /mc/hb_chg.log
         Arbitration Command executed at: Thu Feb  1 16:58:58 AEDT 2024

         Fencing Command executed at: Thu Feb  1 16:58:58 AEDT 2024
[fepuser@fep15-ha-server1: ~]$

Standby (new primary) Mirroring Controller log

Feb  1 16:58:55 fep15-ha-server1 MirroringControllerOpen[6225]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  1 16:58:56 fep15-ha-server1 MirroringControllerOpen[6225]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb  1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: starting degenerate because of result of arbitration command for database server "server2"  result:"0" (MCA00193)
Feb  1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: executing fencing command (MCA00192)
Feb  1 16:58:58 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: fencing command for database server "server2" succeeded: result:"0" (MCA00198)
Feb  1 16:59:00 fep15-ha-server1 MirroringControllerOpen[6225]: LOG: promotion processing completed (MCA00062)

Standby (new primary) database log

2024-02-01 16:58:58.719 AEDT [6256] LOG:  received promote request
2024-02-01 16:58:58.720 AEDT [6259] FATAL:  terminating walreceiver process due to administrator command
2024-02-01 16:58:58.720 AEDT [6256] LOG:  invalid record length at 0/18000060: wanted 24, got 0
2024-02-01 16:58:58.720 AEDT [6256] LOG:  redo done at 0/18000028 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 97.78 s
2024-02-01 16:58:58.758 AEDT [6256] LOG:  selected new timeline ID: 5
2024-02-01 16:58:59.587 AEDT [6256] LOG:  archive recovery complete
2024-02-01 16:59:00.178 AEDT [6254] LOG:  checkpoint starting: force
2024-02-01 16:59:00.180 AEDT [6252] LOG:  database system is ready to accept connections
2024-02-01 16:59:00.917 AEDT [6252] LOG:  received SIGHUP, reloading configuration files
2024-02-01 16:59:03.274 AEDT [6254] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.194 s, total=3.097 s; sync files=2, longest=0.099 s, average=0.097 s; distance=16384 kB, estimate=16384 kB

Scenario 2: Primary database instance is brought down using pg_ctl stop

Result: Failed over, as expected. Arbitration_cmd and Fencing_cmd is not executed.

[fepuser@fep15-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host           host_status  db_proc_status        disk_status
---------------------------------------------------------------------------------------------------
server1    primary                    192.168.100.5  normal       abnormal(wal_sender)  normal
server2    none(inactivated primary)  192.168.100.6  normal       abnormal(postmaster)  normal
[fepuser@fep15-ha-server1: ~]$

Primary database log

2024-02-01 16:30:13.928 AEDT [356137] LOG:  received fast shutdown request
2024-02-01 16:30:13.934 AEDT [356137] LOG:  aborting any active transactions
2024-02-01 16:30:13.935 AEDT [356146] LOG:  user profile status writer shutting down
2024-02-01 16:30:13.940 AEDT [356137] LOG:  background worker "user profile status writer" (PID 356146) exited with exit code 1
2024-02-01 16:30:13.940 AEDT [356137] LOG:  background worker "logical replication launcher" (PID 356147) exited with exit code 1
2024-02-01 16:30:13.941 AEDT [356139] LOG:  shutting down
2024-02-01 16:30:13.986 AEDT [356139] LOG:  checkpoint starting: shutdown immediate
2024-02-01 16:30:14.018 AEDT [356139] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.046 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB
2024-02-01 16:30:14.056 AEDT [356137] LOG:  database system is shut down

Primary Mirroring Controller log

Feb  1 16:30:14 fep15-ha-server2 MirroringControllerOpen[356515]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb  1 16:30:14 fep15-ha-server2 MirroringControllerOpen[356515]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb  1 16:30:14 fep15-ha-server2 MirroringControllerOpen[356515]: LOG: fail over completed.switched from server2 to server1 (MCA00022)

Standby Mirroring Controller log

Feb 1 16:30:14 fep15-ha-server1 MirroringControllerOpen[373364]: LOG: promotion processing completed (MCA00062)

Scenario 3: Postmaster process is killed on primary

Result: Failed over, as expected. Arbitration_cmd and Fencing_cmd is not executed.

[fepuser@fep15-ha-server1: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host           host_status  db_proc_status        disk_status
---------------------------------------------------------------------------------------------------
server1    none(inactivated primary)  192.168.100.5  normal       abnormal(postmaster)  normal
server2    primary                    192.168.100.6  normal       abnormal(wal_sender)  normal
[fepuser@fep15-ha-server1: /mc]$

Primary Mirroring Controller log

Feb  1 16:13:12 fep15-ha-server1 MirroringControllerOpen[371058]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb  1 16:13:12 fep15-ha-server1 MirroringControllerOpen[371058]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb  1 16:13:13 fep15-ha-server1 MirroringControllerOpen[371058]: LOG: fail over completed.switched from server1 to server2 (MCA00022)

Standby Mirroring Controller log

Feb 1 16:13:13 fep15-ha-server2 MirroringControllerOpen[352280]: LOG: promotion processing completed (MCA00062)

Scenario 4: Storage error - Data Storage full on primary

Result: Failed over, as expected. Arbitration_cmd and Fencing_cmd is not executed.

Test case scenario execution log

[fepuser@fep15-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host           host_status  db_proc_status        disk_status
--------------------------------------------------------------------------------------------------------------------------
server1    none(inactivated primary)  192.168.100.5  normal       abnormal(postmaster)  abnormal(data,tran_log,tablespace)
server2    primary                    192.168.100.6  normal       abnormal(wal_sender)  normal
[fepuser@fep15-ha-server2: ~]$

Primary log

Feb  1 11:29:11 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "data storage destination directory(/database/inst1)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  1 11:29:11 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "transaction log storage destination directory(/database/inst1/pg_wal)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  1 11:29:11 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "tablespace directory(/pg_tblspc/enc_tblspc)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  1 11:29:11 fep15-ha-server1 rsyslogd[888]: imjournal: fopen() failed for path: '/var/lib/rsyslog/imjournal.state.tmp': No space left on device [v8.2102.0-5.el8 try https://www.rsyslog.com/e/2013 ]
Feb  1 11:29:12 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "tablespace directory(/pg_tblspc/enc_tblspc)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: WARNING: detected an error on the monitored object "tablespace directory(/pg_tblspc/enc_tblspc)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb  1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: LOG: stopped database instance forcibly (MCA00064)
Feb  1 11:29:13 fep15-ha-server1 MirroringControllerOpen[5955]: LOG: fail over completed.switched from server1 to server2 (MCA00022)

Standby (new primary) log

Feb 1 11:29:13 fep15-ha-server2 MirroringControllerOpen[376691]: LOG: promotion processing completed (MCA00062)

Primary database log

2024-02-01 11:27:30.462 AEDT [9876] ERROR:  could not create file "base/5/16522_vm": No space left on device
2024-02-01 11:27:30.462 AEDT [9876] CONTEXT:  COPY pgbench_accounts, line 1000
2024-02-01 11:27:30.462 AEDT [9876] STATEMENT:  copy pgbench_accounts from stdin with (freeze on)
2024-02-01 11:28:10.854 AEDT [5984] LOG:  checkpoint starting: time
2024-02-01 11:28:13.244 AEDT [5984] LOG:  checkpoint complete: wrote 24 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=2.319 s, sync=0.032 s, total=2.390 s; sync files=24, longest=0.021 s, average=0.002 s; distance=202 kB, estimate=202 kB
2024-02-01 11:29:09.692 AEDT [10377] WARNING:  could not extend file "base/5/16538", but retrying: No space left on device
2024-02-01 11:29:09.692 AEDT [10377] CONTEXT:  COPY pgbench_accounts, line 2000
2024-02-01 11:29:13.121 AEDT [5982] LOG:  received immediate shutdown request
2024-02-01 11:29:13.136 AEDT [5982] LOG:  database system is shut down
~

Scenario 5: Storage error - WAL directory full on primary

Result: Failed over, as expected. Arbitration_cmd and Fencing_cmd is not executed.

[fepuser@fep15-ha-server2: /run/pg_wal]$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.9G  1.9G  1.9M 100% /run
[fepuser@fep15-ha-server2: /run/pg_wal]$

[fepuser@fep15-ha-server1: /database/inst1/log]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host           host_status  db_proc_status        disk_status
---------------------------------------------------------------------------------------------------
server1    primary                    192.168.100.5  normal       abnormal(wal_sender)  normal
server2    none(inactivated primary)  192.168.100.6  normal       abnormal(postmaster)  normal
[fepuser@fep15-ha-server1: /database/inst1/log]$

Primary database log

2024-02-01 12:32:49.594 AEDT [401613] PANIC:  could not write to file "pg_wal/xlogtemp.401613": No space left on device
2024-02-01 12:32:49.594 AEDT [401613] CONTEXT:  writing block 4846 of relation base/5/24714
        COPY pgbench_accounts, line 421000
2024-02-01 12:32:49.594 AEDT [401613] STATEMENT:  copy pgbench_accounts from stdin with (freeze on)
2024-02-01 12:32:50.099 AEDT [400627] LOG:  server process (PID 401613) was terminated by signal 6: Aborted
2024-02-01 12:32:50.099 AEDT [400627] DETAIL:  Failed process was running: copy pgbench_accounts from stdin with (freeze on)
2024-02-01 12:32:50.099 AEDT [400627] LOG:  terminating any other active server processes
2024-02-01 12:32:50.102 AEDT [400627] LOG:  all server processes terminated; reinitializing
2024-02-01 12:32:50.202 AEDT [400627] LOG:  auto-open keystore has been opened
2024-02-01 12:32:50.203 AEDT [401627] LOG:  database system was interrupted; last known up at 2024-02-01 12:29:44 AEDT
2024-02-01 12:32:50.206 AEDT [401630] FATAL:  the database system is in recovery mode
2024-02-01 12:32:50.493 AEDT [401633] FATAL:  the database system is in recovery mode
2024-02-01 12:32:50.500 AEDT [401634] FATAL:  the database system is in recovery mode
2024-02-01 12:32:50.507 AEDT [401635] FATAL:  the database system is in recovery mode
2024-02-01 12:32:50.514 AEDT [401636] FATAL:  the database system is in recovery mode
2024-02-01 12:32:51.248 AEDT [401627] LOG:  database system was not properly shut down; automatic recovery in progress
2024-02-01 12:32:51.254 AEDT [401627] LOG:  redo starts at 0/140000A0
2024-02-01 12:32:51.319 AEDT [401627] LOG:  redo done at 0/15FFEC50 system usage: CPU: user: 0.05 s, system: 0.01 s, elapsed: 0.06 s
2024-02-01 12:32:51.323 AEDT [401639] FATAL:  the database system is not yet accepting connections
2024-02-01 12:32:51.323 AEDT [401639] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-01 12:32:51.331 AEDT [401627] FATAL:  could not write to file "pg_wal/xlogtemp.401627": No space left on device
2024-02-01 12:32:51.332 AEDT [401640] FATAL:  the database system is not yet accepting connections
2024-02-01 12:32:51.332 AEDT [401640] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-01 12:32:51.334 AEDT [400627] LOG:  startup process (PID 401627) exited with exit code 1
2024-02-01 12:32:51.334 AEDT [400627] LOG:  terminating any other active server processes
2024-02-01 12:32:51.339 AEDT [401641] FATAL:  the database system is in recovery mode
2024-02-01 12:32:51.341 AEDT [400627] LOG:  shutting down due to startup process failure
2024-02-01 12:32:51.348 AEDT [400627] LOG:  database system is shut down

Primary Mirroring Controller log

Feb  1 12:32:50 fep15-ha-server2 MirroringControllerOpen[400600]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL:  the database system is in recovery mode#012 (MCA00019)
Feb  1 12:32:51 fep15-ha-server2 MirroringControllerOpen[400600]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL:  the database system is in recovery mode#012 (MCA00019)
Feb  1 12:32:51 fep15-ha-server2 MirroringControllerOpen[400600]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb  1 12:32:51 fep15-ha-server2 MirroringControllerOpen[400600]: LOG: fail over completed.switched from server2 to server1 (MCA00022)

Standby Mirroring Controller log

Feb 1 12:32:51 fep15-ha-server1 MirroringControllerOpen[13212]: LOG: promotion processing completed (MCA00062)

Scenario 6: Mirroring Controler is stopped on primary using mc_ctl stop -M /mc

Result: No Failover, as expected. Arbitration_cmd and Fencing_cmd is not executed.

Primary Mirroring Controller log

Feb 1 18:05:50 fep15-ha-server1 MirroringControllerOpen[11781]: LOG: stopping Mirroring Controller (MCA00041)
Feb 1 18:05:50 fep15-ha-server1 MirroringControllerOpen[11781]: LOG: Mirroring Controller stopped target server:"server1" (MCA00042)

Standby Mirroring Controller log

No event is recorded on the standby Mirroring Controller log.

Primary database log

2024-02-01 18:05:50.163 AEDT [11687] LOG:  received fast shutdown request
2024-02-01 18:05:50.169 AEDT [11687] LOG:  aborting any active transactions
2024-02-01 18:05:50.170 AEDT [11696] LOG:  user profile status writer shutting down
2024-02-01 18:05:50.174 AEDT [11687] LOG:  background worker "user profile status writer" (PID 11696) exited with exit code 1
2024-02-01 18:05:50.174 AEDT [11687] LOG:  background worker "logical replication launcher" (PID 11697) exited with exit code 1
2024-02-01 18:05:50.175 AEDT [11689] LOG:  shutting down
2024-02-01 18:05:50.275 AEDT [11689] LOG:  checkpoint starting: shutdown immediate
2024-02-01 18:05:50.370 AEDT [11689] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.002 s, sync=0.015 s, total=0.118 s; sync files=2, longest=0.009 s, average=0.008 s; distance=16384 kB, estimate=16384 kB
2024-02-01 18:05:50.703 AEDT [11687] LOG:  database system is shut down

Scenario 7: Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e

Result: No Failover, as expected. Arbitration_cmd and Fencing_cmd is not executed. Database instance on primary is up and running.

[fepuser@fep15-ha-server1: /mc]$ mc_ctl stop -M /mc/ -e
stopping Mirroring Controller forcibly (MCA00044)
stopped Mirroring Controller forcibly (MCA00045)
[fepuser@fep15-ha-server1: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown

server_id  host_role  host           host_status  db_proc_status  disk_status
-----------------------------------------------------------------------------
server1    unknown    192.168.100.5  normal       unknown         unknown
server2    standby    192.168.100.6  normal       normal          normal
[fepuser@fep15-ha-server1: /mc]$
[fepuser@fep15-ha-server1: /mc]$ pg_ctl -D /database/inst1/ status
pg_ctl: server is running (PID: 6252)
/opt/fsepv15server64/bin/postgres "-D" "/database/inst1"
[fepuser@fep15-ha-server1: /mc]$

Primary database log

No related event in the database log

Primary Mirroring Controller log

Feb 1 17:11:12 fep15-ha-server1 MirroringControllerOpen[10086]: LOG: stopping Mirroring Controller forcibly (MCA00044)
Feb 1 17:11:12 fep15-ha-server1 MirroringControllerOpen[10086]: LOG: stopped Mirroring Controller forcibly (MCA00045)

Standby Mirroring Controller log

No related event in the Mirroring Controller log.

Scenario 8: Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running

Result: No Failover, as expected. Arbitration_cmd and Fencing_cmd is not executed. Database instance on primary is up and running.

[fepuser@fep15-ha-server1: ~]$ date; kill -9 10992 10979
Thu Feb  1 17:42:01 AEDT 2024
[fepuser@fep15-ha-server1: ~]$
[fepuser@fep15-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown

server_id  host_role  host           host_status  db_proc_status  disk_status
-----------------------------------------------------------------------------
server1    unknown    192.168.100.5  normal       unknown         unknown
server2    standby    192.168.100.6  normal       normal          normal
[fepuser@fep15-ha-server1: ~]$

Mirroring Controller log

No event is recorded in the Mirroring Controller log of primary or standby servers.

Scenario 9: Network error: Network is disrupted between primary and standby - IP address is brought down using ifconfig eth1 down)

Note: Mirroring Controller and replication are running on the same network

Result - Server status: Failed over, as expected. Arbitration_cmd and Fencing_cmd are executed. Split-brain scenario visible.

[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable

server_id  host_role  host         host_status  db_proc_status  disk_status
---------------------------------------------------------------------------
server1    primary    10.1.254.21  normal       normal          normal
server2    standby    10.1.254.22  normal       normal          normal
[fsepuser@fep15-rhel8-ha-server1: ~]$

[fsepuser@fep15-rhel8-ha-server1: ~]$ ps -ef | grep fsepuser
root      445079    1518  0 12:12 pts/0    00:00:00 su - fsepuser
fsepuser  445080  445079  0 12:12 pts/0    00:00:00 -bash
fsepuser  445260       1  0 12:12 pts/0    00:00:00 mc_keeper start -M /mc
fsepuser  445273  445260  0 12:12 pts/0    00:00:02 mc_agent start -M /mc
fsepuser  445300       1  0 12:12 ?        00:00:00 /opt/fsepv15server64/bin/postgres -D /database/inst1
fsepuser  445301  445300  0 12:12 ?        00:00:00 postgres: logger
fsepuser  445302  445300  0 12:12 ?        00:00:00 postgres: checkpointer
fsepuser  445303  445300  0 12:12 ?        00:00:00 postgres: background writer
fsepuser  445305  445300  0 12:12 ?        00:00:00 postgres: walwriter
fsepuser  445306  445300  0 12:12 ?        00:00:00 postgres: autovacuum launcher
fsepuser  445307  445300  0 12:12 ?        00:00:00 postgres: archiver last was 000000090000000000000037.00000028.backup
fsepuser  445308  445300  0 12:12 ?        00:00:00 postgres: logical replication launcher
fsepuser  445997  445300  0 12:15 ?        00:00:00 postgres: walsender repluser 10.1.0.21(60824) streaming 0/38000328
fsepuser  446907  445080  0 12:18 pts/0    00:00:00 ps -ef
fsepuser  446908  445080  0 12:18 pts/0    00:00:00 grep --color=auto fsepuser
[fsepuser@fep15-rhel8-ha-server1: ~]$

------------------------------------------------------

[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable

server_id  host_role  host         host_status  db_proc_status  disk_status
---------------------------------------------------------------------------
server1    primary    10.1.254.21  normal       normal          normal
server2    standby    10.1.254.22  normal       normal          normal
[fsepuser@fep15-rhel8-ha-server2: ~]$
[fsepuser@fep15-rhel8-ha-server2: ~]$ psql
psql (15.0)
Type "help" for help.

postgres=# SHOW primary_conninfo ;
                               primary_conninfo
-------------------------------------------------------------------------------
user=repluser password=Secret host=10.1.0.20 port=27500 application_name=stby
(1 row)

postgres=#

[fsepuser@fep15-rhel8-ha-server2: ~]$ ps  -ef | grep fsepuser
root        1556    1532  0 Feb06 pts/0    00:00:00 su - fsepuser
fsepuser    1557    1556  0 Feb06 pts/0    00:00:00 -bash
fsepuser  557223       1  0 12:15 pts/0    00:00:00 mc_keeper start -M /mc
fsepuser  557236  557223  1 12:15 pts/0    00:00:02 mc_agent start -M /mc
fsepuser  557263       1  0 12:15 ?        00:00:00 /opt/fsepv15server64/bin/postgres -D /database/inst1
fsepuser  557264  557263  0 12:15 ?        00:00:00 postgres: logger
fsepuser  557265  557263  0 12:15 ?        00:00:00 postgres: checkpointer
fsepuser  557266  557263  0 12:15 ?        00:00:00 postgres: background writer
fsepuser  557267  557263  0 12:15 ?        00:00:00 postgres: startup recovering 000000090000000000000038
fsepuser  557268  557263  0 12:15 ?        00:00:00 postgres: walreceiver streaming 0/38000328
fsepuser  558101    1557  0 12:18 pts/0    00:00:00 ps -ef
fsepuser  558102    1557  0 12:18 pts/0    00:00:00 grep --color=auto fsepuser
[fsepuser@fep15-rhel8-ha-server2: ~]$

===================================================================

[root@fep15-rhel8-ha-server1 ~]# ifconfig eth1 down
[root@fep15-rhel8-ha-server1 ~]# date
Thu Feb  8 12:19:33 AEDT 2024
[root@fep15-rhel8-ha-server1 ~]#

[fsepuser@fep15-rhel8-ha-server1: /mc]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.1.0.20  netmask 255.255.255.0  broadcast 10.1.0.255
        inet6 fe80::222:48ff:fe7e:cac7  prefixlen 64  scopeid 0x20<link&>
        ether 00:22:48:7e:ca:c7  txqueuelen 1000  (Ethernet)
        RX packets 465126  bytes 134442811 (128.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 570268  bytes 327462701 (312.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNINGgt;  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1549836  bytes 161595079 (154.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1549836  bytes 161595079 (154.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[fsepuser@fep15-rhel8-ha-server1: /mc]$

===================================================================

[fsepuser@fep15-rhel8-ha-server1: /mc]$ mc_ctl status -M /mc/
mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)
[fsepuser@fep15-rhel8-ha-server1: /mc]$
[fsepuser@fep15-rhel8-ha-server1: /mc]$ cat hb_chg.log
Arbitration Command executed at: Thu Feb  8 12:19:32 AEDT 2024

Fencing Command executed at: Thu Feb  8 12:19:32 AEDT 2024
[fsepuser@fep15-rhel8-ha-server1: /mc]$

------------------------------------------------

[fsepuser@fep15-rhel8-ha-server2: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role  host         host_status  db_proc_status        disk_status
---------------------------------------------------------------------------------
server1    unknown    10.1.254.21  abnormal     unknown               unknown
server2    primary    10.1.254.22  normal       abnormal(wal_sender)  normal
[fsepuser@fep15-rhel8-ha-server2: /mc]$ cat hb_chg.log
Arbitration Command executed at: Thu Feb  8 12:19:32 AEDT 2024

Fencing Command executed at: Thu Feb  8 12:19:32 AEDT 2024
[fsepuser@fep15-rhel8-ha-server2: /mc]$

Server1 Mirroring Controller log

Feb  8 12:19:28 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  8 12:19:30 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb  8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: starting degenerate because of result of arbitration command for database server "server2"  result:"0" (MCA00193)
Feb  8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: starting to detach standby server "server2" automatically (MCA00024)
Feb  8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: executing fencing command (MCA00192)
Feb  8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: fencing command for database server "server2" succeeded: result:"0" (MCA00198)
Feb  8 12:19:32 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: LOG: detach standby server "server2" completed automatically (MCA00025)

Feb  8 12:20:47 fep15-rhel8-ha-server1 MirroringControllerOpen[445273]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: fe_sendauth: no password supplied#012 (MCA00019)
Feb  8 12:21:09 fep15-rhel8-ha-server1 MirroringControllerOpen[447660]: ERROR: mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)

Server2 Mirroring Controller log

Feb  8 12:19:28 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  8 12:19:30 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb  8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: starting degenerate because of result of arbitration command for database server "server1"  result:"0" (MCA00193)
Feb  8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: executing fencing command (MCA00192)
Feb  8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: fencing command for database server "server1" succeeded: result:"0" (MCA00198)
Feb  8 12:19:32 fep15-rhel8-ha-server2 MirroringControllerOpen[557236]: LOG: promotion processing completed (MCA00062)

Primary database log

2024-02-08 12:19:32.252 AEDT [445997] FATAL:  terminating connection due to administrator command
2024-02-08 12:19:32.252 AEDT [445997] STATEMENT:  START_REPLICATION SLOT "repl_slot1" 0/38000000 TIMELINE 9 COMPRESS 0
2024-02-08 12:19:32.264 AEDT [445300] LOG:  received SIGHUP, reloading configuration files
2024-02-08 12:19:32.264 AEDT [445300] LOG:  parameter "synchronous_standby_names" removed from configuration file, reset to default

Standby database log

2024-02-08 12:19:32.252 AEDT [557268] FATAL:  could not receive data from WAL stream: FATAL:  terminating connection due to administrator command
2024-02-08 12:19:32.252 AEDT [557267] LOG:  invalid record length at 0/38000410: wanted 24, got 0
2024-02-08 12:19:32.279 AEDT [558371] LOG:  started streaming WAL from primary at 0/38000000 on timeline 9
2024-02-08 12:19:32.495 AEDT [557267] LOG:  received promote request
2024-02-08 12:19:32.495 AEDT [558371] FATAL:  terminating walreceiver process due to administrator command
2024-02-08 12:19:32.495 AEDT [557267] LOG:  waiting for WAL to become available at 0/38000428
2024-02-08 12:19:32.495 AEDT [557267] LOG:  redo done at 0/380003D8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 247.21 s
2024-02-08 12:19:32.495 AEDT [557267] LOG:  last completed transaction was at log time 2024-02-08 12:16:30.382893+11
2024-02-08 12:19:32.502 AEDT [557267] LOG:  selected new timeline ID: 10
2024-02-08 12:19:32.578 AEDT [557267] LOG:  archive recovery complete
2024-02-08 12:19:32.594 AEDT [557265] LOG:  checkpoint starting: force
2024-02-08 12:19:32.597 AEDT [557263] LOG:  database system is ready to accept connections
2024-02-08 12:19:32.606 AEDT [557263] LOG:  received SIGHUP, reloading configuration files
2024-02-08 12:19:32.723 AEDT [557265] LOG:  checkpoint complete: wrote 5 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.105 s, sync=0.007 s, total=0.130 s; sync files=4, longest=0.003 s, average=0.002 s; distance=16385 kB, estimate=16385 kB

===================================================================

[root@fep15-rhel8-ha-server1 ~]# ifconfig eth1 up
[root@fep15-rhel8-ha-server1 ~]# date
Thu Feb  8 12:34:06 AEDT 2024
[root@fep15-rhel8-ha-server1 ~]#
[root@fep15-rhel8-ha-server1 ~]# su - fsepuser
Last login: Thu Feb  8 12:20:36 AEDT 2024 on pts/0
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
not-switchable

server_id  host_role  host         host_status  db_proc_status        disk_status
---------------------------------------------------------------------------------
server1    primary    10.1.254.21  normal       abnormal(wal_sender)  normal
server2    primary    10.1.254.22  normal       abnormal(wal_sender)  normal
[fsepuser@fep15-rhel8-ha-server1: ~]$

Scenario 10: This scenario is not applicable

This scenario is not applicable, as only one network is configurable, when there is no arbitration server.

Test results with Mirroring Controller and arbitration server

Scenario 1: Primary is stopped; Linux OS is shut down

Result - Server status: Failed over, as expected. Fencing_cmd is executed.

[fsepuser@fep15-rhel8-ha-server2: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role  host       host_status  db_proc_status        disk_status
-------------------------------------------------------------------------------
server1    unknown    10.1.0.20  abnormal     unknown               unknown
server2    primary    10.1.0.21  normal       abnormal(wal_sender)  normal
[fsepuser@fep15-rhel8-ha-server2: /mc]$
[fsepuser@fep15-rhel8-ha-arbiter: ~]$ cat /mc/hb_chg.log

Fencing Command executed at: Mon Feb  5 01:21:13 UTC 2024
[fsepuser@fep15-rhel8-ha-arbiter: ~]$

Standby (new primary) Mirroring Controller log

Feb  5 01:21:08 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  5 01:21:10 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  5 01:21:12 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  5 01:21:12 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: requesting arbitration server "arbiter" to arbitrate for database server "server1" (MCA00159)
Feb  5 01:21:13 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: the arbitration server "arbiter" didn't arbitrate status of database server "server1" as sanity (MCA00197)
Feb  5 01:21:13 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: promotion processing completed (MCA00062)

Arbiter Mirroring Controller log

Feb  5 01:21:04 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server1" (MCR00076)
Feb  5 01:21:05 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: stop monitoring of database server "server1" (MCR00098)
Feb  5 01:21:09 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server1" (MCR00099)
Feb  5 01:21:11 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server1" (MCR00099)
Feb  5 01:21:12 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: database server "server2" requested to arbitrate for database server "server1" status (MCR00056)
Feb  5 01:21:12 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: arbitrating the status of database server "server1" (MCR00057)
Feb  5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server1" (MCR00099)
Feb  5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: could not receive response from database server server1 (MCR00059)
Feb  5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: executing fencing command (MCR00062)
Feb  5 01:21:13 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: fencing command for database server "server1" succeeded: result:"0" (MCR00063)

Standby (new primary) database log

2024-02-05 12:21:03.896 AEDT [1293726] FATAL:  could not receive data from WAL stream: FATAL:  terminating connection due to administrator command
2024-02-05 12:21:03.897 AEDT [1293725] LOG:  invalid record length at 0/1A0003D8: wanted 24, got 0
2024-02-05 12:21:03.977 AEDT [1295256] FATAL:  could not connect to the primary server: connection to server at "10.1.254.21", port 27500 failed: FATAL:  the database system is shutting down
2024-02-05 12:21:03.977 AEDT [1293725] LOG:  waiting for WAL to become available at 0/1A0003F0
2024-02-05 12:21:13.078 AEDT [1293725] LOG:  received promote request
2024-02-05 12:21:13.078 AEDT [1295284] FATAL:  terminating walreceiver process due to administrator command
2024-02-05 12:21:13.078 AEDT [1293725] LOG:  waiting for WAL to become available at 0/1A0003F0
2024-02-05 12:21:13.078 AEDT [1293725] LOG:  redo done at 0/1A0003A0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 327.01 s
2024-02-05 12:21:13.078 AEDT [1293725] LOG:  last completed transaction was at log time 2024-02-05 12:17:26.334037+11
2024-02-05 12:21:13.084 AEDT [1293725] LOG:  selected new timeline ID: 4
2024-02-05 12:21:13.181 AEDT [1293725] LOG:  archive recovery complete
2024-02-05 12:21:13.220 AEDT [1293723] LOG:  checkpoint starting: force
2024-02-05 12:21:13.224 AEDT [1293721] LOG:  database system is ready to accept connections
2024-02-05 12:21:13.247 AEDT [1293723] LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.006 s, total=0.028 s; sync files=2, longest=0.004 s, average=0.003 s; distance=0 kB, estimate=14745 kB
2024-02-05 12:21:13.285 AEDT [1293721] LOG:  received SIGHUP, reloading configuration files

Scenario 2: Primary database instance is brought down using pg_ctl stop

Result - Server status: Failed over, as expected.

[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host       host_status  db_proc_status        disk_status
-----------------------------------------------------------------------------------------------
server1    primary                    10.1.0.20  normal       abnormal(wal_sender)  normal
server2    none(inactivated primary)  10.1.0.21  normal       abnormal(postmaster)  normal
[fsepuser@fep15-rhel8-ha-server2: ~]$

Primary database log

2024-02-05 12:44:10.521 AEDT [1293721] LOG:  received fast shutdown request
2024-02-05 12:44:10.528 AEDT [1293721] LOG:  aborting any active transactions
2024-02-05 12:44:10.533 AEDT [1293721] LOG:  background worker "logical replication launcher" (PID 1295332) exited with exit code 1
2024-02-05 12:44:10.534 AEDT [1293723] LOG:  shutting down
2024-02-05 12:44:10.667 AEDT [1293723] LOG:  checkpoint starting: shutdown immediate
2024-02-05 12:44:10.718 AEDT [1293723] LOG:  checkpoint complete: wrote 5 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.018 s, total=0.055 s; sync files=5, longest=0.016 s, average=0.004 s; distance=16383 kB, estimate=16384 kB
2024-02-05 12:44:10.805 AEDT [1301779] FATAL:  the database system is shutting down
2024-02-05 12:44:10.811 AEDT [1301780] FATAL:  the database system is shutting down
2024-02-05 12:44:10.816 AEDT [1301781] FATAL:  the database system is shutting down
2024-02-05 12:44:10.822 AEDT [1301782] FATAL:  the database system is shutting down
2024-02-05 12:44:10.935 AEDT [1293721] LOG:  database system is shut down

Standby (New Primary) log

2024-02-05 12:44:10.931 AEDT [6723] LOG:  replication terminated by primary server
2024-02-05 12:44:10.931 AEDT [6723] DETAIL:  End of WAL reached on timeline 4 at 0/1D0000A0.
2024-02-05 12:44:10.931 AEDT [6723] FATAL:  could not send end-of-streaming message to primary: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
        no COPY in progress
2024-02-05 12:44:10.932 AEDT [6722] LOG:  invalid record length at 0/1D0000A0: wanted 24, got 0
2024-02-05 12:44:10.940 AEDT [7336] FATAL:  could not connect to the primary server: connection to server at "10.1.0.21", port 27500 failed: Connection refused
                Is the server running on that host and accepting TCP/IP connections?
2024-02-05 12:44:10.940 AEDT [6722] LOG:  waiting for WAL to become available at 0/1D0000B8
2024-02-05 12:44:11.651 AEDT [6722] LOG:  received promote request
2024-02-05 12:44:11.651 AEDT [6722] LOG:  redo done at 0/1D000028 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 138.53 s
2024-02-05 12:44:11.651 AEDT [6722] LOG:  last completed transaction was at log time 2024-02-05 12:42:56.459462+11
2024-02-05 12:44:11.657 AEDT [6722] LOG:  selected new timeline ID: 5
2024-02-05 12:44:11.789 AEDT [6722] LOG:  archive recovery complete
2024-02-05 12:44:11.825 AEDT [6720] LOG:  checkpoint starting: force
2024-02-05 12:44:11.830 AEDT [6718] LOG:  database system is ready to accept connections
2024-02-05 12:44:11.857 AEDT [6718] LOG:  received SIGHUP, reloading configuration files
2024-02-05 12:44:11.980 AEDT [6720] LOG:  checkpoint complete: wrote 5 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.125 s, sync=0.013 s, total=0.155 s; sync files=4, longest=0.006 s, average=0.004 s; distance=16384 kB, estimate=16384 kB

Primary Mirroring Controller log

Feb  5 01:44:10 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL:  the database system is shutting down#012 (MCA00019)
Feb  5 01:44:11 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb  5 01:44:11 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb  5 01:44:11 fep15-rhel8-ha-server2 MirroringControllerOpen[1293691]: LOG: fail over completed.switched from server2 to server1 (MCA00022)

Standby Mirroring Controller log

Feb 5 01:44:11 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: LOG: promotion processing completed (MCA00062)

Arbiter log

No event is recorded in the arbiter Mirroring Controller log

Scenario 3: Postmaster process is killed on primary

Result - Server status: Failed over, as expected

[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host       host_status  db_proc_status        disk_status
-----------------------------------------------------------------------------------------------
server1    none(inactivated primary)  10.1.0.20  normal       abnormal(postmaster)  normal
server2    primary                    10.1.0.21  normal       abnormal(wal_sender)  normal
[fsepuser@fep15-rhel8-ha-server2: ~]$

Primary Mirroring Controller log

Feb  5 01:58:32 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb  5 01:58:32 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb  5 01:58:32 fep15-rhel8-ha-server1 MirroringControllerOpen[6688]: LOG: fail over completed.switched from server1 to server2 (MCA00022)

Standby Mirroring Controller log

Feb 5 01:58:32 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: promotion processing completed (MCA00062)

Arbiter log

No event is recorded in the arbiter Mirroring Controller log

Scenario 4: Storage error - Data Storage full on primary

Result - Server status: Failed over, as expected.

Test case scenario execution log

[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host       host_status  db_proc_status        disk_status
-----------------------------------------------------------------------------------------------------------
server1    primary                    10.1.0.20  normal       abnormal(wal_sender)  normal
server2    none(inactivated primary)  10.1.0.21  normal       abnormal(postmaster)  abnormal(data,tran_log)
[fsepuser@fep15-rhel8-ha-server1: ~]$

Primary Mirroring Controller log

Feb  5 02:22:54 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "data storage destination directory(/database/inst1)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  5 02:22:54 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "transaction log storage destination directory(/database/inst1/pg_wal)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "data storage destination directory(/database/inst1)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: WARNING: detected an error on the monitored object "transaction log storage destination directory(/database/inst1/pg_wal)": read/write error:exception=class java.io.IOException message=No space left on device (MCA00019)
Feb  5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: starting to fail over from server2 to server1 (MCA00021)
Feb  5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: stopped database instance forcibly (MCA00064)
Feb  5 02:22:55 fep15-rhel8-ha-server2 rsyslogd[1051]: imjournal: fopen() failed for path: '/var/lib/rsyslog/imjournal.state.tmp': No space left on device [v8.2102.0-5.el8 try https://www.rsyslog.com/e/2013 ]
Feb  5 02:22:55 fep15-rhel8-ha-server2 MirroringControllerOpen[1302592]: LOG: fail over completed.switched from server2 to server1 (MCA00022)

Standby (new primary) Mirroring Controller log

Feb 5 02:22:55 fep15-rhel8-ha-server1 MirroringControllerOpen[11773]: LOG: promotion processing completed (MCA00062)

Arbiter log

No event is recorded in the arbiter Mirroring Controller log

Primary database log

2024-02-05 13:22:55.591 AEDT [1302622] LOG: received immediate shutdown request
2024-02-05 13:22:55.637 AEDT [1302622] LOG: database system is shut down

Standby (New primary) log

2024-02-05 13:22:55.628 AEDT [11808] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2024-02-05 13:22:55.629 AEDT [11807] LOG:  invalid record length at 0/21000148: wanted 24, got 0
2024-02-05 13:22:55.643 AEDT [15753] FATAL:  could not connect to the primary server: connection to server at "10.1.0.21", port 27500 failed: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2024-02-05 13:22:55.643 AEDT [11807] LOG:  waiting for WAL to become available at 0/21000160
2024-02-05 13:22:55.700 AEDT [11807] LOG:  received promote request
2024-02-05 13:22:55.700 AEDT [11807] LOG:  redo done at 0/21000110 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 853.66 s
2024-02-05 13:22:55.703 AEDT [11807] LOG:  selected new timeline ID: 7
2024-02-05 13:22:55.795 AEDT [11807] LOG:  archive recovery complete
2024-02-05 13:22:55.811 AEDT [11805] LOG:  checkpoint starting: force
2024-02-05 13:22:55.814 AEDT [11803] LOG:  database system is ready to accept connections
2024-02-05 13:22:55.845 AEDT [11805] LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.009 s, total=0.035 s; sync files=2, longest=0.005 s, average=0.005 s; distance=0 kB, estimate=14745 kB
2024-02-05 13:22:55.909 AEDT [11803] LOG:  received SIGHUP, reloading configuration files

Scenario 5: Storage error - WAL directory full on primary

Result - Server status: Failed over, as expected.

[fsepuser@fep15-rhel8-ha-server1: ~]$ df -h /mnt/resource/pg_wal/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       7.9G  7.4G     0 100% /mnt/resource
[fsepuser@fep15-rhel8-ha-server1: ~]$

[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role                  host       host_status  db_proc_status        disk_status
-----------------------------------------------------------------------------------------------
server1    none(inactivated primary)  10.1.0.20  normal       abnormal(postmaster)  normal
server2    primary                    10.1.0.21  normal       abnormal(wal_sender)  normal
[fsepuser@fep15-rhel8-ha-server1: ~]$

Primary database log

2024-02-05 15:13:30.093 AEDT [21638] PANIC:  could not write to file "pg_wal/xlogtemp.21638": No space left on device
2024-02-05 15:13:30.093 AEDT [21638] CONTEXT:  writing block 6186 of relation base/5/16409
        COPY pgbench_accounts, line 503000
2024-02-05 15:13:30.093 AEDT [21638] STATEMENT:  copy pgbench_accounts from stdin with (freeze on)
2024-02-05 15:13:32.410 AEDT [19756] LOG:  server process (PID 21638) was terminated by signal 6: Aborted
2024-02-05 15:13:32.410 AEDT [19756] DETAIL:  Failed process was running: copy pgbench_accounts from stdin with (freeze on)
2024-02-05 15:13:32.410 AEDT [19756] LOG:  terminating any other active server processes
2024-02-05 15:13:32.414 AEDT [19756] LOG:  all server processes terminated; reinitializing
2024-02-05 15:13:33.285 AEDT [21674] LOG:  database system was interrupted; last known up at 2024-02-05 15:13:26 AEDT
2024-02-05 15:13:33.288 AEDT [21678] FATAL:  the database system is in recovery mode
2024-02-05 15:13:33.295 AEDT [21677] FATAL:  the database system is in recovery mode
2024-02-05 15:13:33.298 AEDT [21679] FATAL:  the database system is in recovery mode
2024-02-05 15:13:33.302 AEDT [21680] FATAL:  the database system is in recovery mode
2024-02-05 15:13:33.308 AEDT [21681] FATAL:  the database system is in recovery mode
2024-02-05 15:13:33.634 AEDT [21674] LOG:  database system was not properly shut down; automatic recovery in progress
2024-02-05 15:13:33.640 AEDT [21674] LOG:  redo starts at 0/25000338
2024-02-05 15:13:33.770 AEDT [21674] LOG:  missing contrecord at 0/27FFEDF0
2024-02-05 15:13:33.770 AEDT [21674] LOG:  redo done at 0/27FFEDB0 system usage: CPU: user: 0.06 s, system: 0.06 s, elapsed: 0.13 s
2024-02-05 15:13:33.772 AEDT [21674] FATAL:  could not write to file "pg_wal/xlogtemp.21674": No space left on device
2024-02-05 15:13:33.778 AEDT [19756] LOG:  startup process (PID 21674) exited with exit code 1
2024-02-05 15:13:33.778 AEDT [19756] LOG:  terminating any other active server processes
2024-02-05 15:13:33.779 AEDT [19756] LOG:  shutting down due to startup process failure 2024-02-05 15:13:33.788 AEDT [19756] LOG:  database system is shut down

Primary Mirroring Controller log

Feb  5 04:13:33 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: FATAL:  the database system is in recovery mode#012 (MCA00019)
Feb  5 04:13:34 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: WARNING: detected an error on the monitored object "database process(postmaster)": no response:connection to server at "::1", port 27500 failed: Connection refused#012#011Is the server running on that host and accepting TCP/IP connections?#012 (MCA00019)
Feb  5 04:13:34 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: LOG: starting to fail over from server1 to server2 (MCA00021)
Feb  5 04:13:34 fep15-rhel8-ha-server1 MirroringControllerOpen[19726]: LOG: fail over completed.switched from server1 to server2 (MCA00022)

Standby Mirroring Controller log

Feb 5 04:13:34 fep15-rhel8-ha-server2 MirroringControllerOpen[1310904]: LOG: promotion processing completed (MCA00062)

Arbiter log

No event is recorded in the arbiter Mirroring Controller log

Scenario 6: Mirroring Controler is stopped on primary using mc_ctl stop -M /mc

Result - Server status: No Failover

Primary Mirroring Controller log

Feb  5 04:52:38 fep15-rhel8-ha-server2 MirroringControllerOpen[1346327]: LOG: stopping Mirroring Controller (MCA00041)
Feb  5 04:52:38 fep15-rhel8-ha-server2 MirroringControllerOpen[1310904]: LOG: disconnected from arbitration server "arbiter" (MCA00153)
Feb  5 04:52:38 fep15-rhel8-ha-server2 MirroringControllerOpen[1346327]: LOG: Mirroring Controller stopped  target server:"server2" (MCA00042)

Standby Mirroring Controller log

No event is recorded on the standby Mirroring Controller log.

Arbiter log

Feb 5 04:52:38 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server2" (MCR00076)
Feb 5 04:52:38 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: stop monitoring of database server "server2" (MCR00098)

Primary database log

2024-02-05 15:52:38.299 AEDT [1310934] LOG:  received fast shutdown request
2024-02-05 15:52:38.303 AEDT [1310934] LOG:  aborting any active transactions
2024-02-05 15:52:38.307 AEDT [1310934] LOG:  background worker "logical replication launcher" (PID 1335548) exited with exit code 1
2024-02-05 15:52:38.309 AEDT [1310936] LOG:  shutting down
2024-02-05 15:52:38.414 AEDT [1310936] LOG:  checkpoint starting: shutdown immediate
2024-02-05 15:52:38.443 AEDT [1310936] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.034 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=33059 kB
2024-02-05 15:52:38.706 AEDT [1310934] LOG:  database system is shut down

Scenario 7: Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e

Result - Database instance on primary is up and running without any event logged for Mirroring Controller crash. Server status: No failover

[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl stop -M /mc/ -e
stopping Mirroring Controller forcibly (MCA00044)
stopped Mirroring Controller forcibly (MCA00045)
[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown

server_id  host_role  host       host_status  db_proc_status  disk_status
----------------------------------------------------------------------
server1    standby    10.1.0.20  normal       normal          normal
server2    unknown    10.1.0.21  normal       unknown         unknown
[fsepuser@fep15-rhel8-ha-server2: ~]$
[fsepuser@fep15-rhel8-ha-server2: ~]$ pg_ctl -D /database/inst1/ status
pg_ctl: server is running (PID: 1346548)
/opt/fsepv15server64/bin/postgres "-D" "/database/inst1"
[fsepuser@fep15-rhel8-ha-server2: ~]$

Primary database log

No related event in the database log

Primary Mirroring Controller log

Feb 5 05:00:23 fep15-rhel8-ha-server2 MirroringControllerOpen[1346890]: LOG: stopping Mirroring Controller forcibly (MCA00044)
Feb 5 05:00:23 fep15-rhel8-ha-server2 MirroringControllerOpen[1346890]: LOG: stopped Mirroring Controller forcibly (MCA00045)

Standby Mirroring Controller log

No related event in the Mirroring Controller log.

Arbiter log

Feb 5 05:00:23 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server2" (MCR00076)

Scenario 8: Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running

Result - Database instance on primary is up and running without any event logged for Mirroring Controller crash. Server status: No Failover

[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown

server_id  host_role  host       host_status  db_proc_status  disk_status
-------------------------------------------------------------------------
server1    standby    10.1.0.20  normal       normal          normal
server2    unknown    10.1.0.21  normal       unknown         unknown
[fsepuser@fep15-rhel8-ha-server2: ~]$

[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
unknown

server_id  host_role  host       host_status  db_proc_status  disk_status
-------------------------------------------------------------------------
server1    standby    10.1.0.20  normal       normal          normal
server2    unknown    10.1.0.21  normal       unknown         unknown
[fsepuser@fep15-rhel8-ha-server1: ~]$

Mirroring Controller log

No event is recorded in the Mirroring Controller log of primary or standby servers.

Arbiter log

Feb 5 05:37:40 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: LOG: disconnected from the database server "server2" (MCR00076)

Scenario 9: Network error: Network is disrupted between primary and standby - IP address is brought down using ifconfig eth1 down)

Note: Admin and Arbitration are running on separate networks

Result - Network error: Network disrupted (IP address is taken down using ‘ifconfig eth1 down’). Server status: Failed over. Fencing_cmd is executed. Split-brain scenario is prevented.

[root@fep15-rhel8-ha-server1 ~]# ifconfig eth1 down
[root@fep15-rhel8-ha-server1 ~]#

[fsepuser@fep15-rhel8-ha-server2: /mc]$ mc_ctl status -M /mc/
mirroring status
-----------------
switched

server_id  host_role  host         host_status  db_proc_status        disk_status
---------------------------------------------------------------------------------
server1    unknown    10.1.254.21  abnormal     unknown               unknown
server2    primary    10.1.254.22  normal       abnormal(wal_sender)  normal
[fsepuser@fep15-rhel8-ha-server2: /mc]$
[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/ --arbiter
arbiter_id  host         status
-------------------------------------
arbiter     10.1.254.23  online
[fsepuser@fep15-rhel8-ha-server1: ~]$

[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)
[fsepuser@fep15-rhel8-ha-server1: ~]$

Primary Mirroring Controller log

Feb  8 11:29:41 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  8 11:29:43 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: WARNING: detected an error on the monitored object "server(server2)": no response:ping timeout (MCA00019)
Feb  8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb  8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: starting degenerate because of result of arbitration command for database server "server2"  result:"0" (MCA00193)
Feb  8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: starting to detach standby server "server2" automatically (MCA00024)
Feb  8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: executing fencing command (MCA00192)
Feb  8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: fencing command for database server "server2" succeeded: result:"0" (MCA00198)
Feb  8 11:29:45 fep15-rhel8-ha-server1 MirroringControllerOpen[432697]: LOG: detach standby server "server2" completed automatically (MCA00025)

Feb  8 11:30:27 fep15-rhel8-ha-server1 MirroringControllerOpen[433813]: ERROR: mc_ctl: The IP address or host name of the server where the command was executed in is not found on the definition file "/mc/network.conf" (MCA00051)

Standby Mirroring Controller log

Feb  8 11:29:42 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  8 11:29:44 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: WARNING: detected an error on the monitored object "server(server1)": no response:ping timeout (MCA00019)
Feb  8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: executing arbitration command and inquiring the result of arbitration (MCA00191)
Feb  8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: starting degenerate because of result of arbitration command for database server "server1"  result:"0" (MCA00193)
Feb  8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: executing fencing command (MCA00192)
Feb  8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: fencing command for database server "server1" succeeded: result:"0" (MCA00198)
Feb  8 11:29:46 fep15-rhel8-ha-server2 MirroringControllerOpen[544659]: LOG: promotion processing completed (MCA00062)

Scenario 10: Network error: Streaming network is disrupted; bit streaming replication is configured on second network and is disrupted

Note: Mirroring Controller and replication are running on different networks.

Result - Network error: Arbitration network is disrupted. Arbitration network is configured on second network and is disrupted. No failover. Fencing script is not executed. Server status: No Failover

[root@fep15-rhel8-ha-server2 ~]# date; ifconfig eth1 down
Mon Feb  5 17:17:14 AEDT 2024
[root@fep15-rhel8-ha-server2 ~]#

[fsepuser@fep15-rhel8-ha-server2: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable

server_id  host_role  host       host_status  db_proc_status  disk_status
-------------------------------------------------------------------------
server1    standby    10.1.0.20  normal       normal          normal
server2    primary    10.1.0.21  normal       normal          normal
[fsepuser@fep15-rhel8-ha-server2: ~]$

[fsepuser@fep15-rhel8-ha-server1: ~]$ mc_ctl status -M /mc/
mirroring status
-----------------
switchable

server_id  host_role  host       host_status  db_proc_status  disk_status
-------------------------------------------------------------------------
server1    standby    10.1.0.20  normal       normal          normal
server2    primary    10.1.0.21  normal       normal          normal
[fsepuser@fep15-rhel8-ha-server1: ~]$

[fsepuser@fep15-rhel8-ha-arbiter: ~]$ mc_arb status -M /mc
server_id  host         status
-------------------------------------
server1    10.1.254.21  online
server2    10.1.254.22  offline
[fsepuser@fep15-rhel8-ha-arbiter: ~]$

Primary Mirroring Controller log

Feb 5 06:17:39 fep15-rhel8-ha-server2 MirroringControllerOpen[1354449]: ERROR: timeout waiting for communication with Mirroring Controller Arbitration process server:"arbiter" (MCA00154) Feb 5 06:17:39 fep15-rhel8-ha-server2 MirroringControllerOpen[1354449]: LOG: trying to connect to arbitration server "arbiter" (MCA00150)

Standby database log

No event recorded in the standby Mirroring Controller log.

Arbiter log

Feb 5 06:17:16 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server2" (MCR00099) Feb 5 06:17:18 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server2" (MCR00099) Feb 5 06:17:20 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: WARNING: detected an error on database server "server2" (MCR00099) Feb 5 06:17:39 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]:
disconnected from the database server "server2" (MCR00076) Feb 5 06:17:39 fep15-rhel8-ha-arbiter MirroringControllerArbiter[2468428]: ERROR: unexpected error occurred: class java.lang.RuntimeException: java.net.SocketTimeoutException#012#011at com.fujitsu.fepmc.net.rpc.RPCSocketProcessor.run(Unknown Source)#012#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)#012#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)#012#011at java.lang.Thread.run(Thread.java:748)#012 (MCR00030)

Fujitsu Trunkline

< Back to blog home PostgreSQL Zone Ensuring business continuity: The critical role of Mirroring Controller in high-availability systems Rajesh Kandasamy | February 10, 2024

What is Mirroring Controller?

What is the arbitration server?

How does the arbitration server prevent split brain scenarios?

Test case scenarios

Test results with Mirroring Controller but without arbitration server

Scenario 1: Primary is stopped; Linux OS is shut down

Standby (new primary) Mirroring Controller log

Standby (new primary) database log

Scenario 2: Primary database instance is brought down using pg_ctl stop

Primary database log

Primary Mirroring Controller log

Standby Mirroring Controller log

Scenario 3: Postmaster process is killed on primary

Primary Mirroring Controller log

Standby Mirroring Controller log

Scenario 4: Storage error - Data Storage full on primary

Test case scenario execution log

Primary log

Standby (new primary) log

Primary database log

Scenario 5: Storage error - WAL directory full on primary

Primary database log

Primary Mirroring Controller log

Standby Mirroring Controller log

Scenario 6: Mirroring Controler is stopped on primary using mc_ctl stop -M /mc

Primary Mirroring Controller log

Standby Mirroring Controller log

Primary database log

Scenario 7: Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e

Primary database log

Primary Mirroring Controller log

Standby Mirroring Controller log

Scenario 8: Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running

Mirroring Controller log

Scenario 9: Network error: Network is disrupted between primary and standby - IP address is brought down using ifconfig eth1 down)

Server1 Mirroring Controller log

Server2 Mirroring Controller log

Primary database log

Standby database log

Scenario 10: This scenario is not applicable

Test results with Mirroring Controller and arbitration server

Scenario 1: Primary is stopped; Linux OS is shut down

Standby (new primary) Mirroring Controller log

Arbiter Mirroring Controller log

Standby (new primary) database log

Scenario 2: Primary database instance is brought down using pg_ctl stop

Primary database log

Standby (New Primary) log

Primary Mirroring Controller log

Standby Mirroring Controller log

Arbiter log

Scenario 3: Postmaster process is killed on primary

Primary Mirroring Controller log

Standby Mirroring Controller log

Arbiter log

Scenario 4: Storage error - Data Storage full on primary

Test case scenario execution log

Primary Mirroring Controller log

Standby (new primary) Mirroring Controller log

Arbiter log

Primary database log

Standby (New primary) log

Scenario 5: Storage error - WAL directory full on primary

Primary database log

Primary Mirroring Controller log

Standby Mirroring Controller log

Arbiter log

Scenario 6: Mirroring Controler is stopped on primary using mc_ctl stop -M /mc

Primary Mirroring Controller log

Standby Mirroring Controller log

Arbiter log

Primary database log

Scenario 7: Mirroring Controller is forcibly stopped on primary using mc_ctl stop -M /mc -e

Primary database log

Primary Mirroring Controller log

Standby Mirroring Controller log

Arbiter log

Scenario 8: Mirroring Controller processes (mc_agent and mc_keeper) are killed and Mirroring Controller is stopped, but database instance is kept running

< Back to blog home PostgreSQL Zone

Ensuring business continuity: The critical role of Mirroring Controller in high-availability systems
Rajesh Kandasamy | February 10, 2024