With this setup, business can continue even if the primary server fails, by switching to the standby server. PostgreSQL does not have features for error detection and automatic switching. A mechanism for detecting an abnormality and a script for promoting the standby server ('pg_ctl promote') needs to be incorporated. Alternatively, cluster software such as pgpool-II can be used, with its error detection and automatic switching features.
The recovery actions to take differ depending on which server, primary or standby, is down. The following describes measures for each scenario assuming that synchronous replication is used.
If the primary server fails, you need to rebuild it as a standby server. There are various methods to recover the failed server, but two typical methods are described below.
The first method uses database backup (pg_basebackup command). The flow can be summarized as follows:
By doing that, the failed server can be set up as a new standby server. If the database is large, it will take some time to rebuild.
Recovery overview
Recovery procedure
The second method uses cluster files for recovery. This method is possible only if the database on the failed server has not failed due to hardware issues and you can restart the database. The flow can be summarized as below:
By doing the above, the failed server can be set up as a new standby server. Compared to the pg_basebackup command, the pg_rewind command can rebuild the setup in a relatively short time. However, note that the following conditions need to be met when running pg_rewind:
Recovery overview
Recovery procedure
The actions to take when the standby server fails differ depending on whether the standby server was replicated synchronously or asynchronously. The following describes each measure.
If a standby server fails while replicating synchronously, the primary server will keep waiting for a response and will not be able to send a COMMIT response to the client. Therefore, it will be necessary to disconnect the failed standby server, in order to switch from synchronous replication to asynchronous replication.
Delete the failed server name from synchronous_standby_names in postgresql.conf on the primary server, and reload the file ('pg_ctl reload'). If all standby servers have stopped, empty the synchronous_standby_names parameter and then reload.
To recover the failed server, follow the same recovery procedure for primary server described above.
With asynchronous replication, there is no need to edit postgresql.conf or reload the file if a standby server fails. To recover a failed standby server asynchronously replicated, follow the same recovery procedure for primary server described above.
This article provided an overview of failover and recovery measures available when using PostgreSQL streaming replication, which is very useful to achieve database high availability. FUJITSU Enterprise Postgres offers database multiplexing with attractive features such as error detection and automatic failover.
FUJITSU Enterprise Postgres is designed for developers, data architects, and DBAs seeking the efficiency of a leading open source based database technology with the enhanced security and performance features plus support required in a serious business environment.
Read our most recent articles regarding all aspects of PostgreSQL and FUJITSU Enterprise Postgres.