<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2826169&amp;fmt=gif">
Start  trial

    Start trial

      In this article I will discuss the intricacies of PostgreSQL row storage, and how it can optimize performance and efficiency with detailed insights into page layout and data management.

      PostgreSQL is a robust database system known for its performance, extensibility, and standards compliance. A key aspect of its architecture that contributes to these strengths is how it handles row storage.

      Understanding PostgreSQL’s row storage mechanism is crucial for database designers, administrators, and developers aiming to optimize their database performance and storage efficiency. A solid understanding of this area allows for better schema design, efficient query performance, and effective management of large datasets.

      Why row storage matters

      Row storage directly impacts how data is read, written, and updated in the database. Mismanagement of row storage can lead to inefficient use of disk space, slower queries, and higher maintenance overhead. By grasping the basics of PostgreSQL’s row storage, users can:

      • Optimize disk usage by designing tables that fit efficiently into PostgreSQL’s storage model
      • Avoid performance bottlenecks caused by poorly structured data or excessive use of large columns
      • Take full advantage of PostgreSQL’s advanced features like Multiversion Concurrency Control (MVCC) and TOAST

      Basic structure of a database row

      Each row in a PostgreSQL table is stored as a tuple in a datafile. Data files are organized such that each table is stored in an extensible series of files each up to 1 GB in size. These are stored within a folder associated with the database in which the table exists (via the OID - object IOD). The data files are organized into pages of a fixed size (normally 8 KB).

      PostgreSQL stores all columns of a row in continuous memory, i.e., the same heap file page.

      PostgreSQL utilizes block storage I/O, so when it reads data, it will read all columns of all contained rows, regardless of whether the query uses SELECT * or SELECT col1, col2 to access the row from that block.

      img-dgm-postgresql-structure-of-database-rowEach fixed page consists of:

      • Page header that is 24 bytes long and contains metadata about the page and three pointers related to managing free space within the page: lower, upper and special, in dark blue in the above diagram.
      • An array of item pointers made up of an offset and length to items (actual row data) contained within that page.
      • Free space available to store additional rows (generally going to be available on the last page for a table).
      • Items being the actual data itself (rows in this case).
      • Special space used differently in different types of pages like those used for indexes, which I will discuss in a separate article.

      As a new row is stored, the row is added backwards within the page, from the Upper pointer, and a pointer to it is created in the array of item pointers, from the Lower pointer. The actual row data may move around, but the item pointer will stay in the same location and point to the new location, making updates and other activities more efficient.

      More details about the make up of a page can be found at PostgreSQL documentation > Database Physical Storage > Database Page Layout

      A note on Multiversion Concurrency Control (MVCC)

      MVCC reduces locking of rows by creating new versions of items/rows applicable to specific transactions via visibility information made up of a min transaction identifier and max transaction identifier - transactions within this range will view that specific version of a tuple. Old versions of a row no longer needed still take up space in the page until a vacuum process removes it. That's why a vacuuming strategy in PostgreSQL is very important.

      We will explore the impact of MVCC on page management in a later article.

      PostgreSQL page inspector extension

      PostgreSQL provides an extension called pageinspect that allows us to look into the internal storage structure of the pages we have been discussing. The extension is useful because it provides insights into how the data is stored, allowing us to analyze performance issues and troubleshoot storage inefficiencies, as well as help us develop a better understanding of the database and how it works.

      Typical uses for this extension include:

      • roundel-bug-01Debugging corruption
      • roundel-diverging-arrows-02Investigating bloat
      • roundel-interconnected-dots-02Investigating index efficiency

      The rest of this article will show how we can use this extension to examine a table and see how it is stored in the way described above.

      Putting pageinspect up to good use

      Below we have a simple table called boat containing 500 rows.

      bt=# \d boat 
      Table "public.boat"
      Column | Type | Collation | Nullable | Default
      --------+------------------+-----------+----------+---------
      id | integer | | |
      name | text |           |          |
      reg | double precision |           |          |

      bt=# SELECT COUNT(*) FROM boat;
      count
      -------
      500
      (1 row) 

      bt=# 

      We can create this table and populate it with the following psql commands:

      1. Create the table boat.
        CREATE TABLE boat (id int, name text, reg float);
      2. Insert 500 rows generated randomly.
        INSERT INTO boat (id, name, reg) SELECT I, md5(i::text), log(i) FROM generate_series(1,500) AS i;
      3. Install the pageinspect extension, so we can check in more detail how the data is stored.
        CREATE EXTENSION pageinspect;
      4. Select the ctid system column for each row, to obtain the physical location in the format (page, pointer item).

        A row’s ctid can change if the row is updated, and therefore should not be used as a key for retrieving data.

        bt=# SELECT b.ctid, b.* FROM boat AS b;
          ctid  | id  |                name              |        reg
        --------+-----+----------------------------------+-------------------- 
        … (4,57) | 485 | 218a0aefdldla4be65601cc6ddc1520e | 2.6857417386022635 (4,58) | 486 | 7d04bbbe5494ae9d2f5a76aa1c00fa2f | 2.6866362692622934 (4,59) | 487 | a516a87cfcaef229b342c437fe2b95f7 | 2.6875289612146345 (4,60) | 488 | c3c59e5f8b3e9753913f4d435b53c308 | 2.6884198220027105 (4,61) | 489 | 854d9fca60b4bdO7f9bb215d59ef5561 | 2.6893088591236203 (4,62) | 490 | c410003ef13d451727aeff9082c29a5c | 2.6901960800285142 (4,63) | 491 | 559cb990c9dffd8675f6bc2186971dc2 | 2.6910814921229687 (4,64) | 492 | 55a7cf9c71f1c9c495413f934dd1a158 | 2.6919651027673606 (4,65) | 493 | 2f55707d4193dc27118a0f19a1985716 | 2.6928469192772313 (4,66) | 494 | lbe3bc32e6564055d5ca3e5a354acbef | 2.6937269489236471 (4,67) | 495 | 35051070e572e47d2c26c241ab88307f | 2.6946051989335694 (4,68) | 496 | b534ba68236ba543ae44b22bd110a1d6 | 2.6954816764901977 (4,69) | 497 | 7380ad8a673226ae47fce7bff88e9c33 | 2.6963563887333324 (4,70) | 498 | 05f971b5ec196b8c65b75d2ef8267331 | 2.6972293427597176 (4,71) | 499 | 3cf166c6b73f030b4f67eeaeba301103 | 2.6981005456233897 (4,72) | 500 | cee631121c2ec9232f3a2f028ad5c89b | 2.6989700043360187 (500 rows)

        We can see from the results that the 500 rows are stored across 4 pages with the last page only containing 72 of the 500 rows. So each of the other pages should contain 107 rows each: (500 – 72) / 4 = 107.

      5. Show the page header for the first page (full page).
        bt=# SELECT * FROM page_header(get_raw_page('boat',1));
             lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
        ------------+----------+-------+-------+-------+---------+----------+---------+-----------
         0/58FAFA98 |        0 |     0 |   452 |   488 |    8192 |     8192 |       4 |         0
        (1 row)

        We can see that there isn’t enough spare space to store any more rows.
        Each row data requires 45 bytes to store (excluding the item pointer and other metadata which requires another). But if we subtract the lower pointer position (452) from the upper pointer position (488) we obtain the amount of free space - in this case, only 36 bytes, while we would require 45 bytes to store another row.

      6. Obtain the same information about the header for page.
        bt=# SELECT * FROM page_header(get_raw_page('boat',4));
             lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid
        ------------+----------+-------+-------+-------+---------+----------+---------+-----------
         0/58FB6F28 |        0 |     0 |   312 |  3008 |    8192 |     8192 |       4 |         0
        (1 row)

        The output shows that we have plenty of space left on page 4 to store additional rows:  3008312=2,696 remaining bytes of space.
        Also, note that the special pointer points to the end of the page, indicating that nothing is stored there. This is consistent with our understanding that this area is not used in most cases for general row data storage.

      7. Check item pointer information and its associated bit data.
        bt=# SELECT lp, lp_off, lp_len,t_hoff,t_ctid,t_infomask::bit(16) FROM heap_page_items(get_raw_page(‘boat’,0);
         lp | lp_off | lp_len | t_hoff | t_ctid |    t_infomask
        ----+--------+--------+--------+--------+------------------
          1 |   8120 |     72 |     24 | (0,1)  | 0000100100000010
          2 |   8048 |     72 |     24 | (0,2)  | 0000100100000010
          3 |   7976 |     72 |     24 | (0,3)  | 0000100100000010
          4 |   7904 |     72 |     24 | (0,4)  | 0000100100000010
          5 |   7832 |     72 |     24 | (0,5)  | 0000100100000010
          6 |   7760 |     72 |     24 | (0,6)  | 0000100100000010
          7 |   7688 |     72 |     24 | (0,7)  | 0000100100000010
          8 |   7616 |     72 |     24 | (0,8)  | 0000100100000010
          9 |   7544 |     72 |     24 | (0,9)  | 0000100100000010
         10 |   7472 |     72 |     24 | (0,10) | 0000100100000010
         11 |   7400 |     72 |     24 | (0,11) | 0000100100000010
         12 |   7328 |     72 |     24 | (0,12) | 0000100100000010
         13 |   7256 |     72 |     24 | (0,13) | 0000100100000010
         14 |   7184 |     72 |     24 | (0,14) | 0000100100000010
         15 |   7112 |     72 |     24 | (0,15) | 0000100100000010
         16 |   7040 |     72 |     24 | (0,16) | 0000100100000010
         17 |   6968 |     72 |     24 | (0,17) | 0000100100000010
         18 |   6896 |     72 |     24 | (0,18) | 0000100100000010
         19 |   6824 |     72 |     24 | (0,19) | 0000100100000010
         20 |   6752 |     72 |     24 | (0,20) | 0000100100000010
         21 |   6680 |     72 |     24 | (0,21) | 0000100100000010
         22 |   6608 |     72 |     24 | (0,22) | 0000100100000010
         23 |   6536 |     72 |     24 | (0,23) | 0000100100000010
         24 |   6464 |     72 |     24 | (0,24) | 0000100100000010
         25 |   6392 |     72 |     24 | (0,25) | 0000100100000010
         26 |   6320 |     72 |     24 | (0,26) | 0000100100000010
        
        • lp_off is the offset within the page which shows where the actual row is stored.
        • lp_len is the length of the pointer.
        • t_hoff is the offset within the row data where the payload starts - the first 24 bytes contain metadata about the row data, so as mentioned above, it is additional storage on top of the actual space needed for the row data itself.
        • t_infomask shows the bit settings for various things of a row such as whether it contains any nulls, whether there any columns with variable width, whether all of the column data is to be found on this page or we need to go elsewhere to find what we need - these bits allow the database code to be a lot more efficient.
        For more details on t_infomask bits, refer to the /src/include/access/htup_details.h file of the open source PostgreSQL project.
        /*
         * information stored in t_infomask:
         */
        #define HEAP_HASNULL		0x0001	/* has null attribute(s) */
        #define HEAP_HASVARWIDTH	0x0002	/* has variable-width attribute(s) */
        #define HEAP_HASEXTERNAL	0x0004	/* has external stored attribute(s) */
        #define HEAP_HASOID_OLD		0x0008	/* has an object-id field */
        #define HEAP_XMAX_KEYSHR_LOCK	0x0010	/* xmax is a key-shared locker */
        #define HEAP_COMBOCID		0x0020	/* t_cid is a combo CID */
        #define HEAP_XMAX_EXCL_LOCK	0x0040	/* xmax is exclusive locker */
        #define HEAP_XMAX_LOCK_ONLY	0x0080	/* xmax, if valid, is only a locker */
        
         /* xmax is a shared locker */
        #define HEAP_XMAX_SHR_LOCK	(HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
        
        #define HEAP_LOCK_MASK	(HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
        			 HEAP_XMAX_KEYSHR_LOCK)
        #define HEAP_XMIN_COMMITTED	0x0100	/* t_xmin committed */
        #define HEAP_XMIN_INVALID	0x0200	/* t_xmin invalid/aborted */
        #define HEAP_XMIN_FROZEN	(HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
        #define HEAP_XMAX_COMMITTED	0x0400	/* t_xmax committed */
        #define HEAP_XMAX_INVALID	0x0800	/* t_xmax invalid/aborted */
        #define HEAP_XMAX_IS_MULTI	0x1000	/* t_xmax is a MultiXactId */
        #define HEAP_UPDATED		0x2000	/* this is UPDATEd version of row */
        #define HEAP_MOVED_OFF		0x4000	/* moved to another place by pre-9.0
        					 * VACUUM FULL; kept for binary
        					 * upgrade support */
        #define HEAP_MOVED_IN		0x8000	/* moved from another place by pre-9.0
        					 * VACUUM FULL; kept for binary
        					 * upgrade support */
        #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
      8. To find the file on disk that stores the page data, we can use the following query:
        bt=# SELECT relfilenode FROM pg_class where relname = 'boat'; relfilenode 
        16593 (1 row) 
        Time: 2.922 ms bt=# 1 

        So file 16593 is the file that contains the data for the boat table.

      9. We can use the hexdump utility to look at the file.
        1255       1417       2603_vm  2615_tsm  2661 2691     2838     3381     3574     3764
        1255_fsm   1418       2604     2615_vm   2662 2692     2838_fsm 3394     3575     3764_fsm
        1255_vm    16593      2605     2616      2663 2693     2838_vm  3394_fsm 3576     3764_vm
        1259       16593_fsm  2605 fsm 2616 fsm  2664 2696     2839     3394_vm  3596     3766
        1259_fsm   16596      2605_vm  2616_vm   2665 2699     2840     3395     3597     3767
        1259_vm    16597      2606     2617      2666 2701     2840_fsm 3429     3598     3997 
        13375      174        2606_fsm 2617_fsm  2667 2702     2840_vm  3430     3599     4143 
        13375_fsm  175        2606_vm  2617_vm   2668 2703     2841     3431     3600     4144 
        13375_vm   2187       2607     2618      2669 2704     2995     3433     3600_fsm 4145 
        13378      2224       2607_fsm 2618_fsm  2670 2753     2996     3439     3600_vm  4146 
        13379      2228       2607_vm  2618_vm   2673 2753_fsm 3079     3440     3601     4147 
        13380      2328       2608     2619      2674 2753_vm  3079_fsm 3455     3601_fsm 4148 
        13380_fsm  2336       2608_fsm 2619_fsm  2675 2754     3079_vm  3456     3601_vm  4149 
        13380_vm   2337       2608_vm  2619_vm   2678 2755     3080     3456_fsm 3602     4150 
        13383      2579       2609     2620      2679 2756     3081     3456_vm  3602_fsm 4151 
        13384      2600       2609_fsm 2650      2680 2757     3085     3466     3602_vm  4152 
        13385      2600_fsm   2609_vm  2651      2681 2830     3118     3467     3603     4153 
        133885_fsm 2600_vm    2610     2652      2682 2831     3119     3468     3603_fsm 4154 
        postgres@ubuntu-linux-22-04-02-desktop:/var//lib/postgresql/16/main/base/16592$ hexdump -C 16593 
        00000000 00 00 00 00 08 cf fa 5b  00 60 00 00 c4 01 e8 01  |.......[........|
        00000010 00 20 04 20 00 06 00 00  b8 9f 90 00 70 9f 90 00  |. . ........p...|
        00000020 28 9f 90 00 e0 9e 90 00  98 9e 90 00 5O 9e 90 00  |( ..........P...|
        00000030 08 9e 90 00 co 9d 90 00  78 9d 90 00 30 9d 90 00  |........x...0...|
        00000040 e8 9c 90 00 a0 9c 90 00  58 9c 90 00 16 9c 90 00  |........X.......|
        00000050 c8 9b 90 00 80 9b 90 00  38 9b 90 00 fO 9d 90 00  |........8.......|
        00600060 a8 9a 90 00 60 9a 90 80  18 9d 90 00 do 99 90 00  |.... ...........|
        00600070 88 99 90 80 40 99 90 00  f8 98 90 00 bO 98 90 00  |....@ ..........|
        00600080 68 98 90 60 20 98 90 00  dB 97 90 00 90 97 90 00  |h... ...........|
        00000090 48 97 90 80 00 97 90 00  b8 96 90 00 76 96 90 00  |H...........p...|
        00e00oa0 28 96 90 00 e0 95 90 80  98 95 90 00 56 95 90 00  |(...........P...|
        006000b0 08 95 90 00 co 94 90 00  78 94 90 00 30 94 90 00  |.... ...x...0...|
        000000c8 e8 93 90 00 a0 93 90 60  58 93 90 00 JO 93 90 00  |........X ......|
      10. If we want to look at the actual part of the file that is holding the data for the row with an id of 20 (which has an offset of 6752 on page 0 according to our earlier query).
         15 |   7112 |     72 |     24 | (0,15) | 0000100100000010
         16 |   7040 |     72 |     24 | (0,16) | 0000100100000010
         17 |   6968 |     72 |     24 | (0,17) | 0000100100000010
         18 |   6896 |     72 |     24 | (0,18) | 0000100100000010
         19 |   6824 |     72 |     24 | (0,19) | 0000100100000010
         20 |   6752 |     72 |     24 | (0,20) | 0000100100000010
         21 |   6680 |     72 |     24 | (0,21) | 0000100100000010
         22 |   6608 |     72 |     24 | (0,22) | 0000100100000010
         23 |   6536 |     72 |     24 | (0,23) | 0000100100000010
         24 |   6464 |     72 |     24 | (0,24) | 0000100100000010
      11. We can again use hexdump with some options to output the part of the file we want.

        -n specifies the length we want to output, and -s specifies the amount of data from the beginning that we want to skip (offset).

        postgres@ubuntu-linux-22-04-02-desktop:/var//lib/postgresql/16/main/base/16592$ hexdunp -C -n 72 -s 6752 16593 
        00001a60 a2 04 08 80 00 00 00 00  00 00 00 00 00 00 00 00 |................| 
        00001a70 14 00 03 00 02 09 18 00  14 00 00 00 43 39 38 66 |............C98f|
        00001a80 31 33 37 30 38 32 31 30  31 39 34 63 34 37 35 36 |13708210194c4756| 
        00001a90 38 37 62 65 36 31 30 36  61 33 62 38 34 00 00 00 |87be6106a3b84...|
        00001aa0 80 de 27 d4 04 dl f4 3f                          |..'....?|
        00001aa8  

        We can see that there is some header information of 24 bytes, and then we can see the value 14 in hexadecimal which is 20 in decimal (being the id), followed by the name beginning with 98f13708, which matches the row details as can be seen below (so we know we have correctly found the row data we were looking for in the physical file):

        bt=# SELECT * FROM boat WHERE id = 20;
        id | name | reg
        ----+----------------------------------+--------------------
        20 | 98f13708210194c475687be6106a3b84 | 1.3010299956639813
        (1 row) 
        bt=# 

      In a separate article I will discuss how indexes are stored in the same database page layouts as row data.

      Topics: PostgreSQL, Database performance, PostgreSQL extension, Database optimization, Multiversion Concurrency Control (MVCC), Row storage, pageinspect extension

      Receive our blog

      Search by topic

      Posts by Tag

      See all
      Fujitsu Enterprise Postgres
      The hybrid multi-cloud Postgres backed by Fujitsu
      photo-gary-evans-in-hlight-circle-cyan-to-blue-02
      Gary Evans
      Senior Offerings and Center of Excellence Manager
      Gary Evans heads the Center of Excellence team at Fujitsu Software, providing expert services for customers in relation to PostgreSQL and Fujitsu Enterprise Postgres.
      He previously worked in IBM, Cable and Wireless based in London and the Inland Revenue Department of New Zealand, before joining Fujitsu. With over 15 years’ experience in database technology, Gary appreciates the value of data and how to make it accessible across your organization.
      Gary loves working with organizations to create great outcomes through tailored data services and software.

      Receive our blog

      Fill the form to receive notifications of future posts

      Search by topic

      see all >