
The basics of B-tree index
PostgreSQL’s B-tree index is a cornerstone of its indexing mechanism, providing highly efficient query processing. There are many optimizations in the B-tree index - this article will cover the basics of how they work and how we can utilize the pageinspect extension to examine them.
B-tree indexes are the default index type in PostgreSQL. They are created automatically for primary key and unique constraints. All Indexes in PostgreSQL are secondary, as data pages utilize a heap structure in which no index controls the order of records. Indexes are also stored physically separate from the data structures, with leaf node pages typically storing the key and a tuple identifier (pointer) to the data row in the heap. Access to indexes is entirely under the control of the index type’s index access methods (equal, greater, lesser, less or equal, etc. operators need to be defined for each data type that can be indexed), with the core system’s only knowledge of indexes being via these access methods.
Indexes use the same overall page layout as data pages (see my article PostgreSQL row storage), with a page header of 24 bytes, followed by an array of items pointing to actual items within the page, then free space, followed by the actual items, which are appended backwards from the last part identified as special space.
This use of standard pages allows the regular storage manager and buffer manager to access the index contents.
The data in each item contains the data key indexed, and a pointer to either a child page in the index (leaf or internal node page) or a reference to the table row id (tid).
Balanced structure for consistent performance
The B-tree index maintains a balanced structure, ensuring that all leaf nodes are at the same depth. This property guarantees consistent search, insert, and delete operation performance regardless of the dataset size.
The B-tree index is also multi branched, where each page of the index contains hundreds of pointers to either another index node or to row data, keeping the number of levels very small. On average, a PostgreSQL B-tree index can index up to 108 million rows of a table in only 3 levels.
Data is sorted in ascending order within each index page and in the order of pages also. Pointers between nodes (at both the internal and leaf levels) form a linked list allow us to ‘walk’ the index in either ascending or descending order.
The below diagram illustrates a simple B-tree index of only two levels (typically found for a table of less than 180,000 rows.
Figure 1: Basic 2-level index structure
The very top node page is the meta page - it contains a reference to the block which contains the root node. When we want to utilize an index, we start at the root node, and search through the keys to find the one we need to descend to, repeating this process until we arrive at the leaf node holding the reference to the row data we are after.
For example, from the diagram above, if we were searching for data with a key of 12 using the index, we would descend the second item in the root nodes list of items, because 12 is less than 17 and greater than 9, then we would read through the items on that page until we found 12 (or couldn’t find the value on the page).
Let's use the pageinspect extension to look at an index we created on our boat table.
Examining a simple index on the boat table
I created the B-tree index with the following command: CREATE INDEX boat_idx1 ON boat(id);
Let's start at the top and look at the meta page. The only thing we need for this is the name of the index.
bt=# SELECT * FROM bt_metap('boat_idx1');
magic | version | root | level | fastroot | fastlevel | last_cleanup_num_delpages | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+---------------------------+-------------------------+---------------
340322 | 4 | 3 | 1 | 3 | 1 | 0 | -1 | t
(1 row)
Figure 2: Meta page
The root column of the meta page tells us which block of this index file the root page is on - in this case we see that the root node page is at block 3.
We can then use bt_page_items to look at the root page of the index:
bt=# SELECT * FROM bt_page_items(get_raw_page('boat_idx1' ,3)); itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids ------------+--------+---------+-------+------+-------------------------+------+------+------- 1 | (1,0) | 8 | f | f | | | | 2 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00 | | | 3 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00 | | | 4 | (5,1) | 16 | f | f | 4b 04 00 00 00 00 00 00 | | | 5 | (6,1) | 16 | f | f | b9 05 06 60 00 00 00 00 | | | 6 | (7,1) | 16 | f | f | 27 07 00 00 00 00 00 00 | | | 7 | (8,1) | 16 | f | f | 95 08 00 00 00 00 00 00 | | | 8 | (9,1) | 16 | f | f | 03 Oa 00 00 00 00 00 00 | | | 9 | (10,1) | 16 | f | f | 71 Ob 00 00 00 00 00 00 | | | 10 | (11,1) | 16 | f | f | df Oc 00 00 00 00 00 00 | | | 11 | (12,1) | 16 | f | f | 4d Oe 00 00 00 00 00 00 | | | 12 | (13,1) | 16 | f | f | bb Of 00 00 00 00 00 00 | | | 13 | (14,1) | 16 | f | f | 29 11 00 00 00 00 00 00 | | | 14 | (15,1) | 16 | f | f | 97 12 06 00 00 00 00 00 | | | (14 rows)
Figure 3: Root page details
We can see from the above that the root page points to leaf nodes (we don’t have internal nodes since this is a relatively small table) on blocks 1 and 2, and then 4 to 14 because our block node is on block 3.
This is the case because I created the index when we only had 500 rows in the table, requiring only two pages for the index leaf nodes (each leaf page holds around 300 entries). After creating the index, I added a further 4,500 rows, accounting for the additional 12 pages after the root node page. Therefore, we can deduce that when PostgreSQL created the index, it wrote the leaf nodes first (level 0) and then wrote the root node page.
The meta page was created at block 0, which we can test by trying to look at block 0 with bt_page_items:
bt=# SELECT * FROM bt_page_items(get_raw_page('boat_idx1' ,0)); ERROR: block is a meta page bt=#
Figure 4: Validating meta page location
Next, let's look at a leaf page to see how many references to row data each can hold (we won’t look at the last one because this may not be full yet):
351 | (16,102) | 16 | f | f | 16 07 00 00 00 00 00 00 | f | (16,102) | 352 | (16,103) | 16 | f | f | 17 07 00 00 00 00 00 00 | f | (16,103) | 353 | (16,104) | 16 | f | f | 18 07 00 00 00 00 00 00 | f | (16,104) | 354 | (16,105) | 16 | f | f | 19 07 00 00 00 00 00 00 | f | (16,105) | 355 | (16,106) | 16 | f | f | 1a 07 00 00 00 00 00 00 | f | (16,106) | 356 | (16,107) | 16 | f | f | 1b 07 00 00 00 00 00 00 | f | (16,107) | 357 | (17,1) | 16 | f | f | 1c 07 00 00 00 00 00 00 | f | (17,1) | 358 | (17,2) | 16 | f | f | 1d 07 00 00 00 00 00 00 | f | (17,2) | 359 | (17,3) | 16 | f | f | 1e 07 00 00 00 00 00 00 | f | (17,3) | 360 | (17,4) | 16 | f | f | 1f 07 00 00 00 00 00 00 | f | (17,4) | 361 | (17,5) | 16 | f | f | 20 07 00 00 00 00 00 00 | f | (17,5) | 362 | (17,6) | 16 | f | f | 21 07 00 00 00 00 00 00 | f | (17,6) | 363 | (17,7) | 16 | f | f | 22 07 00 00 00 00 00 00 | f | (17,7) | 364 | (17,8) | 16 | f | f | 23 07 00 00 00 00 00 00 | f | (17,8) | 365 | (17,9) | 16 | f | f | 24 07 00 00 00 00 00 00 | f | (17,9) | 366 | (17,10) | 16 | f | f | 25 07 00 00 00 00 00 00 | f | (17,10) | 367 | (17,11) | 16 | f | f | 26 07 00 00 00 00 00 00 | f | (17,11) | (367 rows)
Figure 5: references per leaf node
We can see each page in our case holds 367 references. 5000 table rows divided by 367 entries per page equals 13.6 pages, and we have 14 leaf pages (see 'Figure 3: Root page details' above), the 14th of which is being filled to just of 60% of its capacity.
238 | (46,74) | 16 | f | f | 84 13 00 00 00 00 00 00 | f | (46,74) | 239 | (46,75) | 16 | f | f | 85 13 00 00 00 00 00 00 | f | (46,75) | 240 | (46,76) | 16 | f | f | 86 13 00 00 00 00 00 00 | f | (46,76) | 241 | (46,77) | 16 | f | f | 87 13 00 00 00 00 00 00 | f | (46,77) | 242 | (46,78) | 16 | f | f | 88 13 00 00 00 00 00 00 | f | (46,78) | (242 rows)
While in our example a leaf page is able to hold 367 entries per page, there are a number of factors that can affect this:
- Fill factor – how much spare space is left in a page for future operations
- Larger indexed column types like text or uuids
- An index that covers multiple columns
We saw that with our example of 5000 rows, we only created a 2-level index with no intermediate nodes. So, how many rows do we need before PostgreSQL adds internal node pages between the root node and the leaf nodes making it a 3-level index?
Threshold for a level increase
Working on the assumption that page size is set to 8KB, in general, a root node can point to around 600 leaf nodes, and each leaf node can store 300 entries (pointers to data rows). This means that a 2-level B-tree index can index a table of up to about 180,000 rows (600 × 300) , before it needs to add a level.
When we add a 3rd level with internal nodes (each able to hold references to around 600 leaf nodes), it allows indexing of a table containing around 108 million rows.
So let’s bump up the number of rows in our table to half a million and see what happens to our index.
bt=#INSERT INTO boat(id,name,reg) SELECT i, md5(i::text), log(i) FROM generate_series(501,500000) as i; INSERT 0 499500 bt=#
We now have 500,000 rows in our table. Let’s have a look at the meta page and see what has changed.
bt=# SELECT * FROM bt_metap('boat_idx1');
magic | version | root | level | fastroot | fastlevel | last_cleanup_num_delpages | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+---------------------------+-------------------------+----------------
340322 | 4 | 412 | 2 | 412 | 2 | 0 | -1 | t
(1 row)
We can see that the root page has changed from block 3 to block 412. Let's have a look at the root table, which used to point to 14 pages of the type leaf node.
bt=# SELECT * FROM bt_page_items(get_raw_page('boat_idx1',412));
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
------------+----------+---------+-------+------+-------------------------+------+------+-------
1 | (3,0) | 8 | f | f | | | |
2 | (411,1) | 16 | f | f | 8c 85 01 00 00 00 00 00 | | |
3 | (698,1) | 16 | f | f | 02 1d 03 00 00 00 00 00 | | |
4 | (984,1) | 16 | f | f | 78 b4 04 00 00 00 00 00 | | |
5 | (1270,1) | 16 | f | f | ee 4b 06 00 00 00 00 00 | | |
(5 rows)
Now the root page has 5 pointers instead of 1. You may notice that the first is only 8 bytes long and has no data. This is a high key tuple that indicates the boundary for the range of keys in this page (root page) and is similar in concept to the high key tuple pointer in internal pages.
If we look at the page stats for one of the four other entries, we can see the type column returns i, indicating that these are internal node pages (i:internal, l: leaf, r: root, e: ignored, d: deleted leaf, D: deleted internal). So we can see that this index is now a 3-level index with internal pages.
bt=# SELECT * FROM bt_page_stats('boat_idx1',411);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo_level | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------------+------------
411 | i | 286 | 0 | 15 | 8192 | 2436 | 3 | 698 | 1 | 0
(1 row)
If we have a look at some of the entries in the internal node, we can see the high key pointer at offset 2.
itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
------------+----------+---------+-------+------+-------------------------+------+------+-------
1 | (574,1) | 16 | f | f | 02 1d 03 00 00 00 00 00 | | |
2 | (287,0) | 8 | f | f | | | |
3 | (288,1) | 16 | f | f | fa 86 01 00 00 00 00 00 | | |
4 | (289,1) | 16 | f | f | 68 88 01 00 00 00 00 00 | | |
5 | (290,1) | 16 | f | f | d6 89 01 00 00 00 00 00 | | |
6 | (291,1) | 16 | f | f | 44 8b 01 00 00 00 00 00 | | |
7 | (292,1) | 16 | f | f | b2 8c 01 00 00 00 00 00 | | |
8 | (293,1) | 16 | f | f | 20 8e 01 00 00 00 00 00 | | |
9 | (294,1) | 16 | f | f | 8e 8f 01 00 00 00 00 00 | | |
10 | (295,1) | 16 | f | f | fc 90 01 00 00 00 00 00 | | |
We will use the block number for the entry at offset 5 - in this case, 290 - and check if these are leaf node pages as we expect.
bt=# SELECT * FROM bt_page_stats('boat_idxi' ,290); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo_level | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------------+------------ 290 | l | 367 | 0 | 16 | 8192 | 808 | 289 | 291 | 0 | 1 (1 row)
The type is leaf node, so things seem to work as we would expect.
How an index works for different operators
Below is an updated illustration to show a 3-level B-tree index aligning it with our boat index, though a lot fewer rows depicted. This version shows a lot more entries in leaf nodes than the previous one.
Searching by equality operator
If we are searching for a row with an id of 689, PostgreSQL will descend from the root node via the pointer which key is less than 689 and the next entry is more than 689 - so, it will follow the 332 pointer to the internal page and repeat this comparison and descend down the entry with a key of 617 to the leaf page it points to. Here it reads through the key entries sequentially until it finds the key 689 and then returns the row data (whole page/block will be read unless the selected data is stored in the index – we will discuss covering indexes in a later article).
Searching by range
Searching by range is very similar to searching by equality. The root page and internal page only have to be navigated once, and then the linked list of leaf node pages can be navigated until the top end of the requested range has been found. This is because the key in index leaf nodes and the leaf pages themselves are stored in key order.
The pointers between nodes (separator pointers) are an added efficiency of the Lehman and Yao algorithm that PostgreSQL uses (also referred to as Btree+).