<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2826169&amp;fmt=gif">
Start  trial

    Start trial

      Earlier this year, I had the incredible opportunity to present at PGConf India, where I delved into the intricacies of Write-Ahead Logging (WAL) in PostgreSQL. My presentation aimed to demystify this crucial database feature that ensures data integrity and enhances performance. 

      I shared insights on WAL's components, its role in crash recovery, and its significance in both streaming and logical replication. The session was a deep dive into the mechanics of WAL, offering attendees a comprehensive understanding of how it underpins the reliability and efficiency of PostgreSQL databases.

      I discussed how PostgreSQL WAL ensures your database survives crashes, supports point-in-time recovery, and forms the bedrock of replication.

      Unlocking the secrets of WAL: My PGConf India 2025 talk

      Below, you'll find the slides from my presentation at PGConf India 2025.

      Understanding Write-Ahead Logging (WAL)

      Vigneshwaran C

      Software Lead Developer

      Fujitsu

      Agenda

      • What is WAL
      • DML prior to WAL
      • WAL was introduced
      • WAL components
      • WAL segment file
      • WAL record types
      • Checkpoint processing
      • Recovery after crash
      • WAL segment management
      • WAL configurations
      • WAL in streaming replication
      • WAL in logical replication
      • WAL Summarizer process
      • Archiving
      • WAL modes
      • WAL level
      • Diagnosing using pg_waldump
      • Diagnosing using pg_walinspect

      What is WAL

      • The WAL (Write-ahead logs) is a sequential record of all changes made to a database
      • WAL files are stored in the directory pg_wal under the data directory
      • Whenever a transaction modifies data in PostgreSQL, the changes are first written to the WAL file before they are applied to the actual on-disk table data files
      • This process is known as write-ahead logging
      • The WAL in PostgreSQL is stored in a set of log files called WAL segments (16MB by default)
      • Once a segment is filled, it is archived, depending on the database configuration
      • WAL segments preceding the one containing the redo record are no longer needed and can be recycled or removed
      • The WAL segments can be replayed during crash recovery or used for replication purposes
      A redo record is a specific type of WAL record that must be replayed during crash recovery or replication

      DML prior to WAL (<Postgres 7.0)

      1. User performs some Insert into table t1 and commits it.
        postgres=# begin;
        BEGIN
        postgres=*# INSERT INTO t1 VALUES(1);
        INSERT 0 1
        postgres=*# COMMIT;
        COMMIT
      2. Server loads the table t1 to shared buffer pool and modified the page creating dirty pages.
      3. Write to the disk files. If there are changes in 100 tables, 100 files had to be written.
      4. The backend responds to the user, confirming that the transaction has been committed.
      Since synchronous write had to be done in random location for various tables, the DML operations were extremely slow in versions 7.0 or earlier

      WAL was introduced

      1. User performs some Insert into table t1 and commits it.
        postgres=# begin;
        BEGIN
        postgres=*# INSERT INTO t1 VALUES(1);
        INSERT 0 1
        postgres=*# COMMIT;
        COMMIT
      2. Server loads the table t1 to shared buffer pool and modified the page creating dirty pages.

        Note: This modified page is not written to the data files on disk immediately.

      3. WAL records for insert operation are written to WAL buffer.
      4. The WAL buffers for the transaction are written to the WAL files during commit operation by the backend in case of sync commit or by walwriter in case of async commit.
      5. The backend responds to the user, confirming that the transaction has been committed.
      6. Checkpointer or background writer will write the dirty buffer to the physical file of the table.

      WAL Writer process

      • It periodically checks the WAL buffer and writes all unwritten XLOG records to the WAL segments.
      • The WAL writer is enabled by default and cannot be disabled.
      • The check interval is set to the configuration parameter wal_writer_delay, which defaults to 200 milliseconds.
      • The backends first block the space in WAL buffer and will then write later.
      • In the case the backends have not finished writing to WAL buffer, WAL writer process will wait till the WAL buffer is written before flushing to disk.

      WAL components

      • WAL Logs

        WAL logs are the binary files in which various transactions are stored. These logs are written to disk before the changes are made to the database. PostgreSQL maintains multiple WAL (Write-Ahead Log) files. Instead of a single WAL file, it uses a sequence of WAL segments to continuously log database changes. When one WAL segment fills up, PostgreSQL creates a new WAL file. This allows for efficient storage and recycling of WAL files.

      • Checkpoints

        Checkpoints are points in the sequence of transactions within a log file where all the data is guaranteed to have been updated with the information before the checkpoints. These are particularly important in crashes since the WAL file recovers from the most recent checkpoint. This is then where the system performs the REDO operation.

      • WAL buffers

        In PostgreSQL the working units that read and write data to your tables are called "buffers”. The buffers of write-ahead logs are typically called transaction log buffers. These determine the amount of memory allocated to storing WAL data (See 'wal_buffers' GUC later).

      • Log Sequence Numbers

        Log sequence numbers are 64-bit integers assigned to individual WAL records within a WAL, indicating the INSERT position. They are byte offsets unique within a single log stream that increase with each new transaction and are of data type pg_lsn. LSNs can help determine the amount of data in bytes between different transactions.

      WAL segment file

      • A WAL segment is 16 MB by default.
      • A WAL file name consists of three parts:
        • Timeline ID: Tracks different versions of the database’s history. Think of it like saving multiple versions of a database.
        • Logical file ID: Represents the WAL file’s sequence number within the timeline.
        • Segment ID: Identifies a specific segment within the log file. Each segment is typically 16 MB in size.

      WAL segment file

      • Each WAL segment is divided into 8 KB pages
      • The first page contains a special header - XLogLongPageHeaderData
      • All other pages have standard headers - XLogPageHeaderData
      • WAL records are written to WAL page

      WAL records

      Each WAL record consists of multiple components:

      • WAL record header
        • LSN (Log Sequence Number): Unique identifier for the record
        • Transaction ID (XID): Transaction that generated the record
        • Record type: Type of operation (e.g., INSERT, UPDATE, DELETE, COMMIT)
        • CRC checksum: Ensures WAL integrity
      • WAL record data
        • Block references: Identifies which pages (relations) are affected
        • Tuple data (optional): Stores new or old row values (for UPDATE/DELETE)
        • Redo information: Required details to redo the change
      • WAL record tail*
        • Full-page Image (FPI): Snapshot of the entire page (used after checkpoints).
        • Commit status: Indicates if the transaction is fully committed.

      *: Optional

      WAL record types

      Type Details
      CREATE Represents the creation of a database object such as a table, index, or other schema-related entity.
      INSERT Logs the insertion of a new row into a table.
      INSERT+INIT Similar to an INSERT record but occurs when inserting into a newly initialized (empty) page.
      UPDATE Logs the modification of an existing row in a table.
      DELETE Logs the deletion of a row from a table.
      TRUNCATE Logs the truncation of a table, which removes all rows efficiently without logging individual row deletions.
      COMMIT Logs a transaction commit, ensuring that all its changes are permanently recorded.
      LOCK Represents a lock operation, typically used for locking database objects such as tables.
      INVALIDATION Logs cache invalidation events to ensure changes to shared data (e.g., catalogs) are recognized by all backends.
      RUNNING_XACTS Logs running transactions at a particular moment, used in replication and recovery to determine transaction visibility.
      HOT_UPDATE Special form of UPDATE that allows heap-only tuple updates (HOT), optimizing performance by avoiding unnecessary index updates.
      HOT_UPDATE+INIT Similar to HOT_UPDATE, but involves an update into an initialized (empty) page.
      NEW_CID Logs the assignment of a new command ID, used to track command execution order within a transaction.

      Checkpoint processing

      1. A WAL record of this checkpoint is recorded in the log
      2. Shared memory contents like clog are written to disk
      3. All dirty pages of data are written to disk
      4. The checkpoint information like checkpoint LSN, timeline, oldest transaction ID are written to the pg_control file
      5. A WAL record saying checkpoint done is recorded in the log. WAL files that are not required (prior to last checkpoint’s location) are removed
      A checkpoint is a process that ensures data consistency by flushing all modified (dirty) buffers from memory to disk and updating metadata. It marks a point in the WAL (Write-Ahead Log) from which crash recovery can start, minimizing WAL replay time.

      Recovery after crash

      ...
      pg_control version number:            1700
      ...
      Latest checkpoint location:           0/449B5E0
      Latest checkpoint's REDO location:    0/449B588
      Latest checkpoint's REDO WAL file:    000000010000000000000004
      ...
      • PostgreSQL read pg_control to identify the redo location from which the restart must be started.
      • It then reads the WAL from the redo point and replays the WAL record. It will skip applying the WAL if it is already applied.
      • By this process all committed transactions are restored.
      • Any partially executed transactions that were not committed will be rolled back.
      • Now recovery is complete and database is ready to use.

      WAL segment management

      A WAL segment switch occurs when:

      • The segment is filled up
      • The function pg_switch_wal() is called
      • archive_mode is enabled and archive_timeout expires

      WAL configurations

      Config Details
      wal_level replica (default).
      wal_buffers Sets the number of disk-page buffers in shared memory for WAL.
      wal_keep_segments Minimum number of file segments to keep in the pg_wal directory.
      If you do not want to archive, set a slightly increased value.
      checkpoint_flush_after Whenever more than this amount of data has been written while performing a checkpoint, attempt to force the OS to issue these writes to the underlying storage.
      archive_mode When archive_mode is enabled, completed WAL segments are sent to archive storage by setting archive_command or archive_library.
      archive_command Command for archiving the WAL.
      Example: cp %p /mnt/serv/arch_dir %f
      synchronous_commit Synchronization level of the standby servers.
      Set remote_apply to optimize data freshness and on to maintain performance and reliability.
      max_slot_wal_keep_size Specify the maximum size of WAL files that replication slots are allowed to retain in the pg_wal directory at checkpoint time.
      max_wal_senders Number of standby servers +1.
      This value cannot exceed the value set in max_connections.
      wal_sender_timeout Time to wait before determining that the WAL receiver process is in an abnormal state.
      max_connections Maximum number of simultaneous connections to the database server.
      Set a value greater than max_wal_senders.
      synchronous_standby_names Standby servers for synchronous replication. No setting is required for asynchronous replication.
      Example: s1

      WAL in streaming replication

      • Streaming replication, a standard feature of PostgreSQL, allows the updated information on the primary server to be transferred to the standby server in real time, so that the databases of the primary server and standby server can be kept in sync.
      •  
      • Streaming replication works by transferring, or shipping, the WAL to the standby server in real time, and applying it on the standby server.
      •  

      WAL in streaming replication

      1. User performs transaction commit on primary
      2. The backend process handles the transaction and writes changes to the primary data pages.
      3. Changes made by the transaction are logged in the Write-Ahead Log (WAL).
      4. The WAL sender process reads the WAL records from the primary node.
      5. WAL records are transmitted over a network connection from the primary to the standby.
      6. The WAL receiver process on the standby node writes incoming WAL records to its local WAL storage.
      7. The standby server reads the WAL records that were received.
      8. The startup process applies WAL changes to the standby’s data files.
      9. The recovery process applies WAL changes to the standby’s data files.

      WAL in streaming replication

      Logical replication is a method of replicating selective data changes based on the definition of the publication from the publisher to subscriber.

      1. User initiates a transaction commit on the publisher node
      2. The backend process applies changes to the publisher's data files
      3. Changes are logged in WAL (Write-Ahead Log) on the publisher
      4. The WAL sender process checks if the change should be published based on the publication rules
      5. The WAL sender transmits only relevant committed transactions to the subscriber node
      6. The apply worker receives changes and applies transactions to the subscriber's data
      7. The subscriber node can accept read and write queries independently

      WAL Summarizer process

      • The WAL Summarizer process was introduced in PostgreSQL 17. It supports incremental backups, tracks changes to all database blocks (including relations and visibility maps) and writes these modifications to WAL summary files located in the pg_wal/summaries directory
      • The WAL Summarizer process operates as follows:
        • During each checkpoint, it reads WAL segment files from the previous REDO point to the current REDO point
        • It tracks changes to all blocks of all relations (inc. visibility maps) using the WAL segment files
        • It writes the summary to WAL summary files located in the pg_wal/summaries/ directory
      • In the context of the WAL Summarizer process, previous REDO point and current REDO point are referred to as start_lsn and end_lsn, respectively
      • Wal summarizer process will be started if summarize_wal guc is set.
      The Redo Point in PostgreSQL refers to the Log Sequence Number (LSN) from which WAL (Write-Ahead Log) replay must start during crash recovery. It is the position where the last checkpoint started, and serves as the earliest point required for consistent recovery.

      Archiving

      • Continuous archiving is a feature that copies WAL segment files to an archival area at the time when a WAL segment switch occurs

        It is performed by the archiver (background) process. The copied file is called an archive log. This feature is typically used for hot physical backup and PITR (Point-in-Time Recovery)

      • The path to the archival area is set by the archive_command configuration parameter

        For example, the following parameter would copy WAL segment files to the directory /home/db/archives/ every time a segment switch occurs:

        archive_command = cp %p /home/db/archives/%f
      • The %p placeholder represents the copied WAL segment, and the %f placeholder represents the archive log
      • The archive_command parameter can be set to any Unix command or tool.

        This enables the use of scp or other file backup tools to transfer archive logs to remote hosts, instead of relying on simple copy commands.

      WAL modes

      synchronous_commit

      Specifies how much WAL processing must complete before the database server returns a success indication to the client.
      Valid values are remote_apply, on (default), remote_write, local, and off

      synchronous_commit
      setting
      Local
      durable
      commot
      Standby
      durable commit
      after Postgres crash
      Standby
      durable commit
      after OS crash
      Standby
      query
      consistency
      remote_apply
      on
      remote_write
      local
      off

      WAL level

      • minimal
        • This level generates the least WAL volume. It logs no row information for permanent relations in transactions that create or rewrite them. This can make operations much faster. But minimal WAL does not contain sufficient information for point-in-time recovery, so replica or higher must be used to enable continuous archiving (archive_mode) and streaming binary replication.
      • replica
        • It means that the WAL contains enough information for streaming replication to read-only standby servers.
      • logical
        • In logical level, the same information is logged as with replica, plus information needed to extract logical change sets from the WAL. Using a level of logical will increase the WAL volume, particularly if many tables are configured for REPLICA IDENTITY FULL and many UPDATE and DELETE statements are executed.

      Diagnosing using pg_waldump

      pg_waldump is a PostgreSQL server application to display a human-readable rendering of the WAL of a PostgreSQL database cluster.

      pg_waldump [option...] [startseg [endseg]]

      Let’s say you want to see the WAL’s generated for table test_wal whose tablespace ID is 1663, database ID is 5, and relation ID is 16384

      ./pg_waldump node1/pg_wal/000000010000000000000001 -R 1663/5/16384
      rmgr: Heap  len (rec/tot): 59/59, tx: 744, lsn: 0/014AE238, prev 0/014AE200, 
      desc: INSERT+INIT off: 1, flags: 0x08, blkref #0: rel 1663/5/16384 blk 0
      rmgr: Heap  len (rec/tot): 70/70, tx: 755, lsn: 0/014F0C28, prev 0/014F0BF0, 
      desc: HOT_UPDATE old_xmax: 755, old_off: 1, old_infobits: [], flags: 0x10, new_xmax: 0, new_off: 2,
      blkref #0: rel 1663/5/16384 blk 0
      rmgr: Heap  len (rec/tot): 59/59, tx: 756, lsn: 0/014F0CD8, prev 0/014F0CA0, 
      desc: INSERT off: 3, flags: 0x08, blkref #0: rel 1663/5/16384 blk 0
      rmgr: Heap  len (rec/tot): 54/54, tx: 757, lsn: 0/014F0D48, prev 0/014F0D18, 
      desc: DELETE xmax: 757, off: 3, infobits: [KEYS_UPDATED], flags: 0x00, 
      blkref #0: rel 1663/5/16384 blk 0

      Diagnosing using pg_walinspect

      • pg_walinspect is an extension which provides SQL functions that allow you to inspect WAL content
      • This is very useful in cloud environment for people having only database access
      • Similar to pg_waldump, but accessible through SQL rather than separate utility
      • Gets WAL record info about a record located at or after the in_lsn argument
      postgres=# SELECT * FROM pg_get_wal_records_info('0/014AEC80', '0/014FE6B0') LIMIT 1;
      -[ RECORD 1 ]----+---------------------------------------------------------
      start_lsn        | 0/14AEC80
      end_lsn          | 0/14AECB8
      prev_lsn         | 0/14AEC50
      xid              | 752
      resource_manager | Heap
      record_type      | DELETE
      record_length    | 54
      main_data_length | 8
      fpi_length       | 0
      description      | xmax: 752, off: 2, infobits: [KEYS_UPDATED], flags: 0x00
      block_ref        | blkref #0: rel 1663/5/16384 fork main blk 0

      Diagnosing using pg_walinspect

      Gets information about each block reference from all the valid WAL records between start_lsn and end_lsn with one or more block references.

      postgres=# SELECT * FROM pg_get_wal_block_info('0/014AEC80', '0/014FE6B0') limit 1;
      -[ RECORD 1 ]-----+---------------------------------------------------------
      start_lsn         | 0/14AEC80
      end_lsn           | 0/14AECB8
      prev_lsn          | 0/14AEC50
      block_id          | 0
      reltablespace     | 1663
      reldatabase       | 5
      relfilenode       | 16384
      relforknumber     | 0
      relblocknumber    | 0
      xid               | 752
      resource_manager  | Heap
      record_type       | DELETE
      record_length     | 54
      main_data_length  | 8
      block_data_length | 0
      block_fpi_length  | 0
      block_fpi_info    | 
      description       | xmax: 752, off: 2, infobits: [KEYS_UPDATED], flags: 0x00
      block_data        | 
      block_fpi_data    | 

      Summary

      • What is WAL
      • Why WAL is required
      • How WAL helps in recovery
      • How WAL helps in logical & streaming replication
      • Various WAL configurations
      • Diagnosis of WAL

      References

      Thank you

      Vigneshwaran C

      Software Lead Developer

      Fujitsu

      linkedin.com/in/vigneshwaran-c-80b36029

      Conclusion

      ill-people-using-laptop-around-shield-and-gear-and-magnifying-glass-01-variation-04Presenting at PGConf India 2025 was an enriching experience, allowing me to share my insights on WAL with a vibrant community of PostgreSQL enthusiasts. I tried to encapsulate the essence of WAL, from its fundamental concepts to its critical role in ensuring data integrity and performance. By exploring the various components, configurations, and diagnostic tools associated with WAL, I aimed to provide a comprehensive understanding that attendees could apply in their own database environments.

      I hope that sharing the slides here serve as a valuable resource for anyone looking to deepen their knowledge of PostgreSQL's WAL mechanism. Thank you for joining me on this journey through the intricacies of WAL!

      Topics: PostgreSQL

      Receive our blog

      Search by topic

      see all >

      Read our latest blogs

      Read our most recent articles regarding all aspects of PostgreSQL and Fujitsu Enterprise Postgres.

      Receive our blog

      Fill the form to receive notifications of future posts

      Search by topic

      see all >