
PostgreSQL is a robust database system known for its performance, extensibility, and standards compliance. A key aspect of its architecture that contributes to these strengths is how it handles row storage.
Understanding PostgreSQL’s row storage mechanism is crucial for database designers, administrators, and developers aiming to optimize their database performance and storage efficiency. A solid understanding of this area allows for better schema design, efficient query performance, and effective management of large datasets.
Why row storage matters
Row storage directly impacts how data is read, written, and updated in the database. Mismanagement of row storage can lead to inefficient use of disk space, slower queries, and higher maintenance overhead. By grasping the basics of PostgreSQL’s row storage, users can:
Optimize disk usage by designing tables that fit efficiently into PostgreSQL’s storage model
Avoid performance bottlenecks caused by poorly structured data or excessive use of large columns
Take full advantage of PostgreSQL’s advanced features like Multiversion Concurrency Control (MVCC) and TOAST
Basic structure of a database row
Each row in a PostgreSQL table is stored as a tuple in a datafile. Data files are organized such that each table is stored in an extensible series of files each up to 1 GB in size. These are stored within a folder associated with the database in which the table exists (via the OID - object IOD). The data files are organized into pages of a fixed size (normally 8 KB).
PostgreSQL stores all columns of a row in continuous memory, i.e., the same heap file page.
PostgreSQL utilizes block storage I/O, so when it reads data, it will read all columns of all contained rows, regardless of whether the query uses SELECT * or SELECT col1, col2 to access the row from that block.
Each fixed page consists of:
- Page header that is 24 bytes long and contains metadata about the page and three pointers related to managing free space within the page: lower, upper and special, in dark blue in the above diagram.
- An array of item pointers made up of an offset and length to items (actual row data) contained within that page.
- Free space available to store additional rows (generally going to be available on the last page for a table).
- Items being the actual data itself (rows in this case).
- Special space used differently in different types of pages like those used for indexes, which I will discuss in a separate article.
As a new row is stored, the row is added backwards within the page, from the Upper pointer, and a pointer to it is created in the array of item pointers, from the Lower pointer. The actual row data may move around, but the item pointer will stay in the same location and point to the new location, making updates and other activities more efficient.
A note on Multiversion Concurrency Control (MVCC)
MVCC reduces locking of rows by creating new versions of items/rows applicable to specific transactions via visibility information made up of a min transaction identifier and max transaction identifier - transactions within this range will view that specific version of a tuple. Old versions of a row no longer needed still take up space in the page until a vacuum process removes it. That's why a vacuuming strategy in PostgreSQL is very important.
We will explore the impact of MVCC on page management in a later article.
PostgreSQL page inspector extension
PostgreSQL provides an extension called pageinspect that allows us to look into the internal storage structure of the pages we have been discussing. The extension is useful because it provides insights into how the data is stored, allowing us to analyze performance issues and troubleshoot storage inefficiencies, as well as help us develop a better understanding of the database and how it works.
Typical uses for this extension include:
Debugging corruption
Investigating bloat
Investigating index efficiency
The rest of this article will show how we can use this extension to examine a table and see how it is stored in the way described above.
Putting pageinspect up to good use
Below we have a simple table called boat containing 500 rows.
bt=# \d boat
Table "public.boat"
Column | Type | Collation | Nullable | Default
--------+------------------+-----------+----------+---------
id | integer | | |
name | text | | |
reg | double precision | | |
bt=# SELECT COUNT(*) FROM boat;
count
-------
500
(1 row)
bt=#
We can create this table and populate it with the following psql commands:
- Create the table boat.
CREATE TABLE boat (id int, name text, reg float);
- Insert 500 rows generated randomly.
INSERT INTO boat (id, name, reg) SELECT I, md5(i::text), log(i) FROM generate_series(1,500) AS i;
- Install the pageinspect extension, so we can check in more detail how the data is stored.
CREATE EXTENSION pageinspect;
- Select the ctid system column for each row, to obtain the physical location in the format (page, pointer item).
A row’s ctid can change if the row is updated, and therefore should not be used as a key for retrieving data.
bt=# SELECT b.ctid, b.* FROM boat AS b; ctid | id | name | reg --------+-----+----------------------------------+--------------------
… (4,57) | 485 | 218a0aefdldla4be65601cc6ddc1520e | 2.6857417386022635 (4,58) | 486 | 7d04bbbe5494ae9d2f5a76aa1c00fa2f | 2.6866362692622934 (4,59) | 487 | a516a87cfcaef229b342c437fe2b95f7 | 2.6875289612146345 (4,60) | 488 | c3c59e5f8b3e9753913f4d435b53c308 | 2.6884198220027105 (4,61) | 489 | 854d9fca60b4bdO7f9bb215d59ef5561 | 2.6893088591236203 (4,62) | 490 | c410003ef13d451727aeff9082c29a5c | 2.6901960800285142 (4,63) | 491 | 559cb990c9dffd8675f6bc2186971dc2 | 2.6910814921229687 (4,64) | 492 | 55a7cf9c71f1c9c495413f934dd1a158 | 2.6919651027673606 (4,65) | 493 | 2f55707d4193dc27118a0f19a1985716 | 2.6928469192772313 (4,66) | 494 | lbe3bc32e6564055d5ca3e5a354acbef | 2.6937269489236471 (4,67) | 495 | 35051070e572e47d2c26c241ab88307f | 2.6946051989335694 (4,68) | 496 | b534ba68236ba543ae44b22bd110a1d6 | 2.6954816764901977 (4,69) | 497 | 7380ad8a673226ae47fce7bff88e9c33 | 2.6963563887333324 (4,70) | 498 | 05f971b5ec196b8c65b75d2ef8267331 | 2.6972293427597176 (4,71) | 499 | 3cf166c6b73f030b4f67eeaeba301103 | 2.6981005456233897 (4,72) | 500 | cee631121c2ec9232f3a2f028ad5c89b | 2.6989700043360187 (500 rows)We can see from the results that the 500 rows are stored across 4 pages with the last page only containing 72 of the 500 rows. So each of the other pages should contain 107 rows each: (500 – 72) / 4 = 107.
- Show the page header for the first page (full page).
bt=# SELECT * FROM page_header(get_raw_page('boat',1)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ------------+----------+-------+-------+-------+---------+----------+---------+----------- 0/58FAFA98 | 0 | 0 | 452 | 488 | 8192 | 8192 | 4 | 0 (1 row)
We can see that there isn’t enough spare space to store any more rows.
Each row data requires 45 bytes to store (excluding the item pointer and other metadata which requires another). But if we subtract the lower pointer position (452) from the upper pointer position (488) we obtain the amount of free space - in this case, only 36 bytes, while we would require 45 bytes to store another row. - Obtain the same information about the header for page.
bt=# SELECT * FROM page_header(get_raw_page('boat',4)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ------------+----------+-------+-------+-------+---------+----------+---------+----------- 0/58FB6F28 | 0 | 0 | 312 | 3008 | 8192 | 8192 | 4 | 0 (1 row)
The output shows that we have plenty of space left on page 4 to store additional rows: 3008–312=2,696 remaining bytes of space.
Also, note that the special pointer points to the end of the page, indicating that nothing is stored there. This is consistent with our understanding that this area is not used in most cases for general row data storage. - Check item pointer information and its associated bit data.
bt=# SELECT lp, lp_off, lp_len,t_hoff,t_ctid,t_infomask::bit(16) FROM heap_page_items(get_raw_page(‘boat’,0); lp | lp_off | lp_len | t_hoff | t_ctid | t_infomask ----+--------+--------+--------+--------+------------------ 1 | 8120 | 72 | 24 | (0,1) | 0000100100000010 2 | 8048 | 72 | 24 | (0,2) | 0000100100000010 3 | 7976 | 72 | 24 | (0,3) | 0000100100000010 4 | 7904 | 72 | 24 | (0,4) | 0000100100000010 5 | 7832 | 72 | 24 | (0,5) | 0000100100000010 6 | 7760 | 72 | 24 | (0,6) | 0000100100000010 7 | 7688 | 72 | 24 | (0,7) | 0000100100000010 8 | 7616 | 72 | 24 | (0,8) | 0000100100000010 9 | 7544 | 72 | 24 | (0,9) | 0000100100000010 10 | 7472 | 72 | 24 | (0,10) | 0000100100000010 11 | 7400 | 72 | 24 | (0,11) | 0000100100000010 12 | 7328 | 72 | 24 | (0,12) | 0000100100000010 13 | 7256 | 72 | 24 | (0,13) | 0000100100000010 14 | 7184 | 72 | 24 | (0,14) | 0000100100000010 15 | 7112 | 72 | 24 | (0,15) | 0000100100000010 16 | 7040 | 72 | 24 | (0,16) | 0000100100000010 17 | 6968 | 72 | 24 | (0,17) | 0000100100000010 18 | 6896 | 72 | 24 | (0,18) | 0000100100000010 19 | 6824 | 72 | 24 | (0,19) | 0000100100000010 20 | 6752 | 72 | 24 | (0,20) | 0000100100000010 21 | 6680 | 72 | 24 | (0,21) | 0000100100000010 22 | 6608 | 72 | 24 | (0,22) | 0000100100000010 23 | 6536 | 72 | 24 | (0,23) | 0000100100000010 24 | 6464 | 72 | 24 | (0,24) | 0000100100000010 25 | 6392 | 72 | 24 | (0,25) | 0000100100000010 26 | 6320 | 72 | 24 | (0,26) | 0000100100000010
- lp_off is the offset within the page which shows where the actual row is stored.
- lp_len is the length of the pointer.
- t_hoff is the offset within the row data where the payload starts - the first 24 bytes contain metadata about the row data, so as mentioned above, it is additional storage on top of the actual space needed for the row data itself.
- t_infomask shows the bit settings for various things of a row such as whether it contains any nulls, whether there any columns with variable width, whether all of the column data is to be found on this page or we need to go elsewhere to find what we need - these bits allow the database code to be a lot more efficient.
For more details on t_infomask bits, refer to the /src/include/access/htup_details.h file of the open source PostgreSQL project.
/* * information stored in t_infomask: */ #define HEAP_HASNULL 0x0001 /* has null attribute(s) */ #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */ #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */ #define HEAP_HASOID_OLD 0x0008 /* has an object-id field */ #define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */ #define HEAP_COMBOCID 0x0020 /* t_cid is a combo CID */ #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */ #define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker */ /* xmax is a shared locker */ #define HEAP_XMAX_SHR_LOCK (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK) #define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \ HEAP_XMAX_KEYSHR_LOCK) #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */ #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */ #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID) #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */ #define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */ #define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */ #define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */ #define HEAP_MOVED_OFF 0x4000 /* moved to another place by pre-9.0 * VACUUM FULL; kept for binary * upgrade support */ #define HEAP_MOVED_IN 0x8000 /* moved from another place by pre-9.0 * VACUUM FULL; kept for binary * upgrade support */ #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
- To find the file on disk that stores the page data, we can use the following query:
bt=# SELECT relfilenode FROM pg_class where relname = 'boat'; relfilenode
16593 (1 row)
Time: 2.922 ms bt=# 1So file 16593 is the file that contains the data for the boat table.
- We can use the hexdump utility to look at the file.
1255 1417 2603_vm 2615_tsm 2661 2691 2838 3381 3574 3764 1255_fsm 1418 2604 2615_vm 2662 2692 2838_fsm 3394 3575 3764_fsm 1255_vm 16593 2605 2616 2663 2693 2838_vm 3394_fsm 3576 3764_vm 1259 16593_fsm 2605 fsm 2616 fsm 2664 2696 2839 3394_vm 3596 3766 1259_fsm 16596 2605_vm 2616_vm 2665 2699 2840 3395 3597 3767 1259_vm 16597 2606 2617 2666 2701 2840_fsm 3429 3598 3997 13375 174 2606_fsm 2617_fsm 2667 2702 2840_vm 3430 3599 4143 13375_fsm 175 2606_vm 2617_vm 2668 2703 2841 3431 3600 4144 13375_vm 2187 2607 2618 2669 2704 2995 3433 3600_fsm 4145 13378 2224 2607_fsm 2618_fsm 2670 2753 2996 3439 3600_vm 4146 13379 2228 2607_vm 2618_vm 2673 2753_fsm 3079 3440 3601 4147 13380 2328 2608 2619 2674 2753_vm 3079_fsm 3455 3601_fsm 4148 13380_fsm 2336 2608_fsm 2619_fsm 2675 2754 3079_vm 3456 3601_vm 4149 13380_vm 2337 2608_vm 2619_vm 2678 2755 3080 3456_fsm 3602 4150 13383 2579 2609 2620 2679 2756 3081 3456_vm 3602_fsm 4151 13384 2600 2609_fsm 2650 2680 2757 3085 3466 3602_vm 4152 13385 2600_fsm 2609_vm 2651 2681 2830 3118 3467 3603 4153 133885_fsm 2600_vm 2610 2652 2682 2831 3119 3468 3603_fsm 4154 postgres@ubuntu-linux-22-04-02-desktop:/var//lib/postgresql/16/main/base/16592$ hexdump -C 16593 00000000 00 00 00 00 08 cf fa 5b 00 60 00 00 c4 01 e8 01 |.......[........| 00000010 00 20 04 20 00 06 00 00 b8 9f 90 00 70 9f 90 00 |. . ........p...| 00000020 28 9f 90 00 e0 9e 90 00 98 9e 90 00 5O 9e 90 00 |( ..........P...| 00000030 08 9e 90 00 co 9d 90 00 78 9d 90 00 30 9d 90 00 |........x...0...| 00000040 e8 9c 90 00 a0 9c 90 00 58 9c 90 00 16 9c 90 00 |........X.......| 00000050 c8 9b 90 00 80 9b 90 00 38 9b 90 00 fO 9d 90 00 |........8.......| 00600060 a8 9a 90 00 60 9a 90 80 18 9d 90 00 do 99 90 00 |.... ...........| 00600070 88 99 90 80 40 99 90 00 f8 98 90 00 bO 98 90 00 |....@ ..........| 00600080 68 98 90 60 20 98 90 00 dB 97 90 00 90 97 90 00 |h... ...........| 00000090 48 97 90 80 00 97 90 00 b8 96 90 00 76 96 90 00 |H...........p...| 00e00oa0 28 96 90 00 e0 95 90 80 98 95 90 00 56 95 90 00 |(...........P...| 006000b0 08 95 90 00 co 94 90 00 78 94 90 00 30 94 90 00 |.... ...x...0...| 000000c8 e8 93 90 00 a0 93 90 60 58 93 90 00 JO 93 90 00 |........X ......|
- If we want to look at the actual part of the file that is holding the data for the row with an id of 20 (which has an offset of 6752 on page 0 according to our earlier query).
15 | 7112 | 72 | 24 | (0,15) | 0000100100000010 16 | 7040 | 72 | 24 | (0,16) | 0000100100000010 17 | 6968 | 72 | 24 | (0,17) | 0000100100000010 18 | 6896 | 72 | 24 | (0,18) | 0000100100000010 19 | 6824 | 72 | 24 | (0,19) | 0000100100000010 20 | 6752 | 72 | 24 | (0,20) | 0000100100000010 21 | 6680 | 72 | 24 | (0,21) | 0000100100000010 22 | 6608 | 72 | 24 | (0,22) | 0000100100000010 23 | 6536 | 72 | 24 | (0,23) | 0000100100000010 24 | 6464 | 72 | 24 | (0,24) | 0000100100000010
- We can again use hexdump with some options to output the part of the file we want.
-n specifies the length we want to output, and -s specifies the amount of data from the beginning that we want to skip (offset).
postgres@ubuntu-linux-22-04-02-desktop:/var//lib/postgresql/16/main/base/16592$ hexdunp -C -n 72 -s 6752 16593 00001a60 a2 04 08 80 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00001a70 14 00 03 00 02 09 18 00 14 00 00 00 43 39 38 66 |............C98f| 00001a80 31 33 37 30 38 32 31 30 31 39 34 63 34 37 35 36 |13708210194c4756| 00001a90 38 37 62 65 36 31 30 36 61 33 62 38 34 00 00 00 |87be6106a3b84...| 00001aa0 80 de 27 d4 04 dl f4 3f |..'....?| 00001aa8
We can see that there is some header information of 24 bytes, and then we can see the value 14 in hexadecimal which is 20 in decimal (being the id), followed by the name beginning with 98f13708, which matches the row details as can be seen below (so we know we have correctly found the row data we were looking for in the physical file):
bt=# SELECT * FROM boat WHERE id = 20;
id | name | reg
----+----------------------------------+--------------------
20 | 98f13708210194c475687be6106a3b84 | 1.3010299956639813
(1 row)
bt=#
In a separate article I will discuss how indexes are stored in the same database page layouts as row data.