I shared insights on WAL's components, its role in crash recovery, and its significance in both streaming and logical replication. The session was a deep dive into the mechanics of WAL, offering attendees a comprehensive understanding of how it underpins the reliability and efficiency of PostgreSQL databases.

Unlocking the secrets of WAL: My PGConf India 2025 talk
Below, you'll find the slides from my presentation at PGConf India 2025.

Understanding Write-Ahead Logging (WAL)
Vigneshwaran C
Software Lead Developer
Fujitsu

Agenda
- What is WAL
- DML prior to WAL
- WAL was introduced
- WAL components
- WAL segment file
- WAL record types
- Checkpoint processing
- Recovery after crash
- WAL segment management
- WAL configurations
- WAL in streaming replication
- WAL in logical replication
- WAL Summarizer process
- Archiving
- WAL modes
- WAL level
- Diagnosing using pg_waldump
- Diagnosing using pg_walinspect

What is WAL
- The WAL (Write-ahead logs) is a sequential record of all changes made to a database
- WAL files are stored in the directory pg_wal under the data directory
- Whenever a transaction modifies data in PostgreSQL, the changes are first written to the WAL file before they are applied to the actual on-disk table data files
- This process is known as write-ahead logging
- The WAL in PostgreSQL is stored in a set of log files called WAL segments (16MB by default)
- Once a segment is filled, it is archived, depending on the database configuration
- WAL segments preceding the one containing the redo record are no longer needed and can be recycled or removed
- The WAL segments can be replayed during crash recovery or used for replication purposes

DML prior to WAL (<Postgres 7.0)
- User performs some Insert into table t1 and commits it.
postgres=# begin; BEGIN postgres=*# INSERT INTO t1 VALUES(1); INSERT 0 1 postgres=*# COMMIT; COMMIT
- Server loads the table t1 to shared buffer pool and modified the page creating dirty pages.
- Write to the disk files. If there are changes in 100 tables, 100 files had to be written.
- The backend responds to the user, confirming that the transaction has been committed.

WAL was introduced
- User performs some Insert into table t1 and commits it.
postgres=# begin; BEGIN postgres=*# INSERT INTO t1 VALUES(1); INSERT 0 1 postgres=*# COMMIT; COMMIT
- Server loads the table t1 to shared buffer pool and modified the page creating dirty pages.
Note: This modified page is not written to the data files on disk immediately.
- WAL records for insert operation are written to WAL buffer.
- The WAL buffers for the transaction are written to the WAL files during commit operation by the backend in case of sync commit or by walwriter in case of async commit.
- The backend responds to the user, confirming that the transaction has been committed.
- Checkpointer or background writer will write the dirty buffer to the physical file of the table.

WAL Writer process
- It periodically checks the WAL buffer and writes all unwritten XLOG records to the WAL segments.
- The WAL writer is enabled by default and cannot be disabled.
- The check interval is set to the configuration parameter wal_writer_delay, which defaults to 200 milliseconds.
- The backends first block the space in WAL buffer and will then write later.
- In the case the backends have not finished writing to WAL buffer, WAL writer process will wait till the WAL buffer is written before flushing to disk.

WAL components
- WAL Logs
WAL logs are the binary files in which various transactions are stored. These logs are written to disk before the changes are made to the database. PostgreSQL maintains multiple WAL (Write-Ahead Log) files. Instead of a single WAL file, it uses a sequence of WAL segments to continuously log database changes. When one WAL segment fills up, PostgreSQL creates a new WAL file. This allows for efficient storage and recycling of WAL files.
- Checkpoints
Checkpoints are points in the sequence of transactions within a log file where all the data is guaranteed to have been updated with the information before the checkpoints. These are particularly important in crashes since the WAL file recovers from the most recent checkpoint. This is then where the system performs the REDO operation.
- WAL buffers
In PostgreSQL the working units that read and write data to your tables are called "buffers”. The buffers of write-ahead logs are typically called transaction log buffers. These determine the amount of memory allocated to storing WAL data (See 'wal_buffers' GUC later).
- Log Sequence Numbers
Log sequence numbers are 64-bit integers assigned to individual WAL records within a WAL, indicating the INSERT position. They are byte offsets unique within a single log stream that increase with each new transaction and are of data type pg_lsn. LSNs can help determine the amount of data in bytes between different transactions.

WAL segment file
- A WAL segment is 16 MB by default.
- A WAL file name consists of three parts:
- Timeline ID: Tracks different versions of the database’s history. Think of it like saving multiple versions of a database.
- Logical file ID: Represents the WAL file’s sequence number within the timeline.
- Segment ID: Identifies a specific segment within the log file. Each segment is typically 16 MB in size.

WAL segment file
- Each WAL segment is divided into 8 KB pages
- The first page contains a special header - XLogLongPageHeaderData
- All other pages have standard headers - XLogPageHeaderData
- WAL records are written to WAL page

WAL records
Each WAL record consists of multiple components:
- WAL record header
- LSN (Log Sequence Number): Unique identifier for the record
- Transaction ID (XID): Transaction that generated the record
- Record type: Type of operation (e.g., INSERT, UPDATE, DELETE, COMMIT)
- CRC checksum: Ensures WAL integrity
- WAL record data
- Block references: Identifies which pages (relations) are affected
- Tuple data (optional): Stores new or old row values (for UPDATE/DELETE)
- Redo information: Required details to redo the change
- WAL record tail*
- Full-page Image (FPI): Snapshot of the entire page (used after checkpoints).
- Commit status: Indicates if the transaction is fully committed.
*: Optional

WAL record types
Type | Details |
CREATE | Represents the creation of a database object such as a table, index, or other schema-related entity. |
INSERT | Logs the insertion of a new row into a table. |
INSERT+INIT | Similar to an INSERT record but occurs when inserting into a newly initialized (empty) page. |
UPDATE | Logs the modification of an existing row in a table. |
DELETE | Logs the deletion of a row from a table. |
TRUNCATE | Logs the truncation of a table, which removes all rows efficiently without logging individual row deletions. |
COMMIT | Logs a transaction commit, ensuring that all its changes are permanently recorded. |
LOCK | Represents a lock operation, typically used for locking database objects such as tables. |
INVALIDATION | Logs cache invalidation events to ensure changes to shared data (e.g., catalogs) are recognized by all backends. |
RUNNING_XACTS | Logs running transactions at a particular moment, used in replication and recovery to determine transaction visibility. |
HOT_UPDATE | Special form of UPDATE that allows heap-only tuple updates (HOT), optimizing performance by avoiding unnecessary index updates. |
HOT_UPDATE+INIT | Similar to HOT_UPDATE, but involves an update into an initialized (empty) page. |
NEW_CID | Logs the assignment of a new command ID, used to track command execution order within a transaction. |

Checkpoint processing
- A WAL record of this checkpoint is recorded in the log
- Shared memory contents like clog are written to disk
- All dirty pages of data are written to disk
- The checkpoint information like checkpoint LSN, timeline, oldest transaction ID are written to the pg_control file
- A WAL record saying checkpoint done is recorded in the log. WAL files that are not required (prior to last checkpoint’s location) are removed

Recovery after crash
... pg_control version number: 1700 ... Latest checkpoint location: 0/449B5E0 Latest checkpoint's REDO location: 0/449B588 Latest checkpoint's REDO WAL file: 000000010000000000000004 ...
- PostgreSQL read pg_control to identify the redo location from which the restart must be started.
- It then reads the WAL from the redo point and replays the WAL record. It will skip applying the WAL if it is already applied.
- By this process all committed transactions are restored.
- Any partially executed transactions that were not committed will be rolled back.
- Now recovery is complete and database is ready to use.

WAL segment management
A WAL segment switch occurs when:
- The segment is filled up
- The function pg_switch_wal() is called
- archive_mode is enabled and archive_timeout expires

WAL configurations
Config | Details |
wal_level | replica (default). |
wal_buffers | Sets the number of disk-page buffers in shared memory for WAL. |
wal_keep_segments | Minimum number of file segments to keep in the pg_wal directory. If you do not want to archive, set a slightly increased value. |
checkpoint_flush_after | Whenever more than this amount of data has been written while performing a checkpoint, attempt to force the OS to issue these writes to the underlying storage. |
archive_mode | When archive_mode is enabled, completed WAL segments are sent to archive storage by setting archive_command or archive_library. |
archive_command | Command for archiving the WAL. Example: cp %p /mnt/serv/arch_dir %f |
synchronous_commit | Synchronization level of the standby servers. Set remote_apply to optimize data freshness and on to maintain performance and reliability. |
max_slot_wal_keep_size | Specify the maximum size of WAL files that replication slots are allowed to retain in the pg_wal directory at checkpoint time. |
max_wal_senders | Number of standby servers +1. This value cannot exceed the value set in max_connections. |
wal_sender_timeout | Time to wait before determining that the WAL receiver process is in an abnormal state. |
max_connections | Maximum number of simultaneous connections to the database server. Set a value greater than max_wal_senders. |
synchronous_standby_names | Standby servers for synchronous replication. No setting is required for asynchronous replication. Example: s1 |

WAL in streaming replication
- Streaming replication, a standard feature of PostgreSQL, allows the updated information on the primary server to be transferred to the standby server in real time, so that the databases of the primary server and standby server can be kept in sync.
- Streaming replication works by transferring, or shipping, the WAL to the standby server in real time, and applying it on the standby server.

WAL in streaming replication
- User performs transaction commit on primary
- The backend process handles the transaction and writes changes to the primary data pages.
- Changes made by the transaction are logged in the Write-Ahead Log (WAL).
- The WAL sender process reads the WAL records from the primary node.
- WAL records are transmitted over a network connection from the primary to the standby.
- The WAL receiver process on the standby node writes incoming WAL records to its local WAL storage.
- The standby server reads the WAL records that were received.
- The startup process applies WAL changes to the standby’s data files.
- The recovery process applies WAL changes to the standby’s data files.

WAL in streaming replication
Logical replication is a method of replicating selective data changes based on the definition of the publication from the publisher to subscriber.
- User initiates a transaction commit on the publisher node
- The backend process applies changes to the publisher's data files
- Changes are logged in WAL (Write-Ahead Log) on the publisher
- The WAL sender process checks if the change should be published based on the publication rules
- The WAL sender transmits only relevant committed transactions to the subscriber node
- The apply worker receives changes and applies transactions to the subscriber's data
- The subscriber node can accept read and write queries independently

WAL Summarizer process
- The WAL Summarizer process was introduced in PostgreSQL 17. It supports incremental backups, tracks changes to all database blocks (including relations and visibility maps) and writes these modifications to WAL summary files located in the pg_wal/summaries directory
- The WAL Summarizer process operates as follows:
- During each checkpoint, it reads WAL segment files from the previous REDO point to the current REDO point
- It tracks changes to all blocks of all relations (inc. visibility maps) using the WAL segment files
- It writes the summary to WAL summary files located in the pg_wal/summaries/ directory
- In the context of the WAL Summarizer process, previous REDO point and current REDO point are referred to as start_lsn and end_lsn, respectively
- Wal summarizer process will be started if summarize_wal guc is set.

Archiving
- Continuous archiving is a feature that copies WAL segment files to an archival area at the time when a WAL segment switch occurs
It is performed by the archiver (background) process. The copied file is called an archive log. This feature is typically used for hot physical backup and PITR (Point-in-Time Recovery)
- The path to the archival area is set by the archive_command configuration parameter
For example, the following parameter would copy WAL segment files to the directory /home/db/archives/ every time a segment switch occurs:
archive_command = cp %p /home/db/archives/%f
- The %p placeholder represents the copied WAL segment, and the %f placeholder represents the archive log
- The archive_command parameter can be set to any Unix command or tool.
This enables the use of scp or other file backup tools to transfer archive logs to remote hosts, instead of relying on simple copy commands.

WAL modes
synchronous_commit
Specifies how much WAL processing must complete before the database server returns a success indication to the client.
Valid values are remote_apply, on (default), remote_write, local, and off
synchronous_commit setting |
Local durable commot |
Standby durable commit after Postgres crash |
Standby durable commit after OS crash |
Standby query consistency |
remote_apply | ||||
on | ||||
remote_write | ||||
local | ||||
off |

WAL level
- minimal
- This level generates the least WAL volume. It logs no row information for permanent relations in transactions that create or rewrite them. This can make operations much faster. But minimal WAL does not contain sufficient information for point-in-time recovery, so replica or higher must be used to enable continuous archiving (archive_mode) and streaming binary replication.
- replica
- It means that the WAL contains enough information for streaming replication to read-only standby servers.
- logical
- In logical level, the same information is logged as with replica, plus information needed to extract logical change sets from the WAL. Using a level of logical will increase the WAL volume, particularly if many tables are configured for REPLICA IDENTITY FULL and many UPDATE and DELETE statements are executed.

Diagnosing using pg_waldump
pg_waldump is a PostgreSQL server application to display a human-readable rendering of the WAL of a PostgreSQL database cluster.
pg_waldump [option...] [startseg [endseg]]
Let’s say you want to see the WAL’s generated for table test_wal whose tablespace ID is 1663, database ID is 5, and relation ID is 16384
./pg_waldump node1/pg_wal/000000010000000000000001 -R 1663/5/16384 rmgr: Heap len (rec/tot): 59/59, tx: 744, lsn: 0/014AE238, prev 0/014AE200, desc: INSERT+INIT off: 1, flags: 0x08, blkref #0: rel 1663/5/16384 blk 0 rmgr: Heap len (rec/tot): 70/70, tx: 755, lsn: 0/014F0C28, prev 0/014F0BF0, desc: HOT_UPDATE old_xmax: 755, old_off: 1, old_infobits: [], flags: 0x10, new_xmax: 0, new_off: 2, blkref #0: rel 1663/5/16384 blk 0 rmgr: Heap len (rec/tot): 59/59, tx: 756, lsn: 0/014F0CD8, prev 0/014F0CA0, desc: INSERT off: 3, flags: 0x08, blkref #0: rel 1663/5/16384 blk 0 rmgr: Heap len (rec/tot): 54/54, tx: 757, lsn: 0/014F0D48, prev 0/014F0D18, desc: DELETE xmax: 757, off: 3, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/5/16384 blk 0

Diagnosing using pg_walinspect
- pg_walinspect is an extension which provides SQL functions that allow you to inspect WAL content
- This is very useful in cloud environment for people having only database access
- Similar to pg_waldump, but accessible through SQL rather than separate utility
- Gets WAL record info about a record located at or after the in_lsn argument
postgres=# SELECT * FROM pg_get_wal_records_info('0/014AEC80', '0/014FE6B0') LIMIT 1; -[ RECORD 1 ]----+--------------------------------------------------------- start_lsn | 0/14AEC80 end_lsn | 0/14AECB8 prev_lsn | 0/14AEC50 xid | 752 resource_manager | Heap record_type | DELETE record_length | 54 main_data_length | 8 fpi_length | 0 description | xmax: 752, off: 2, infobits: [KEYS_UPDATED], flags: 0x00 block_ref | blkref #0: rel 1663/5/16384 fork main blk 0

Diagnosing using pg_walinspect
Gets information about each block reference from all the valid WAL records between start_lsn and end_lsn with one or more block references.
postgres=# SELECT * FROM pg_get_wal_block_info('0/014AEC80', '0/014FE6B0') limit 1; -[ RECORD 1 ]-----+--------------------------------------------------------- start_lsn | 0/14AEC80 end_lsn | 0/14AECB8 prev_lsn | 0/14AEC50 block_id | 0 reltablespace | 1663 reldatabase | 5 relfilenode | 16384 relforknumber | 0 relblocknumber | 0 xid | 752 resource_manager | Heap record_type | DELETE record_length | 54 main_data_length | 8 block_data_length | 0 block_fpi_length | 0 block_fpi_info | description | xmax: 752, off: 2, infobits: [KEYS_UPDATED], flags: 0x00 block_data | block_fpi_data |

Summary
- What is WAL
- Why WAL is required
- How WAL helps in recovery
- How WAL helps in logical & streaming replication
- Various WAL configurations
- Diagnosis of WAL

References
- The Internals of PostgreSQL
- PostgreSQL 14 Internals
- PostgreSQL Documentation

Conclusion
Presenting at PGConf India 2025 was an enriching experience, allowing me to share my insights on WAL with a vibrant community of PostgreSQL enthusiasts. I tried to encapsulate the essence of WAL, from its fundamental concepts to its critical role in ensuring data integrity and performance. By exploring the various components, configurations, and diagnostic tools associated with WAL, I aimed to provide a comprehensive understanding that attendees could apply in their own database environments.
I hope that sharing the slides here serve as a valuable resource for anyone looking to deepen their knowledge of PostgreSQL's WAL mechanism. Thank you for joining me on this journey through the intricacies of WAL!