
Extending full-text search
The pg_bigm module is an extension we use with Fujitsu Enterprise Postgres to improve full-text search performance by using 2-gram (bigram) indexes. A bigram is a pair of adjacent characters in a string. The extension helps find text patterns more quickly, especially for languages or datasets with lots of similar words or where the traditional full-text search is slow.
Use cases
- Fast text search
When you need to search through large amounts of text data quickly, such as in chat logs, emails, or articles, pg_bigm can speed up the process by indexing bigrams.
- Handling typo acceptability
It's useful for scenarios where exact matches aren't always required, like searching for user-generated content where spelling mistakes are common.
- Multilingual search
pg_bigm can be particularly effective in languages where words can be split into meaningful parts or where inflection is common, improving search relevance.
Examples
- Searching in a product catalog
Imagine you have an e-commerce site where users frequently search for products. If someone searches for smartfon instead of smartphone, pg_bigm can help return relevant results quickly by matching the bigrams.
- Searching through large logs
If you're managing logs that contain error messages, you might want to find specific errors quickly. With pg_bigm, you can search for the text 404 Not Found and find matches even if the logs contain slightly varied messages like 404 not found or 404 error not found.
How it works
- Bigram index creation
The extension creates indexes on bigrams, making it quicker to locate text patterns. - Search query execution
When a search query is executed, the extension uses the bigram index to quickly find matching records.
Executing a text search without pg_bigm
- Add content to be searched
CREATE TABLE documents (id SERIAL PRIMARY KEY, content TEXT );
INSERT INTO documents (content) VALUES
('is All gold glitters that not'),
('that not All is glitters gold'),
('A database cluster is a collection of databases managed by a Fujitsu Enterprise Postgres server'),
('If you are hearing this definition for the first time, you might be wondering what it means'),
('The term database cluster in Fujitsu Enterprise Postgres does not mean ‘a group of database servers’'),
('A Fujitsu Enterprise Postgres server runs on a single host and manages a single database cluster'),
('Figure 1.1 shows the logical structure of a database cluster'), ('A database is a collection of database objects'),
('In the relational database theory, a database object is a data structure used to store or reference data'),
('A (heap) table is a typical example, and there are many others, such as indexes, sequences, views, functions'),
('In Fujitsu Enterprise Postgres, databases themselves are also database objects and are logically separated from each other'),
('All other database objects (e.g., tables, indexes, etc) belong to their respective databases'),
('All the database objects in Fujitsu Enterprise Postgres are internally managed by respective object identifiers (OIDs), which are unsigned 4-byte integers'),
('The relations between database objects and their respective OIDs are stored in appropriate system catalogs, depending on the type of objects'),
('For example, OIDs of databases and heap tables are stored in pg_database and pg_class respectively'),
('You can find out the OIDs you want to know by issuing the queries such as the following'),
('A database cluster is basically a single directory, referred to as base directory'),
('It contains some subdirectories and many files'),
('When you execute the initdb utility to initialize a new database cluster, a base directory will be created under the specified directory'),
('The path of the base directory is usually set to the environment variable PGDATA'),
('Figure 1.2 shows an example of database cluster in Fujitsu Enterprise Postgres'),
('A database is a subdirectory under the base subdirectory, and each of the tables and indexes is (at least) one file stored under the subdirectory of the database to which it belongs'),
('There are several subdirectories containing particular data and configuration files'),
('While Fujitsu Enterprise Postgres supports tablespaces, the meaning of the term is different from other RDBMSs'),
('A tablespace in Fujitsu Enterprise Postgres is a single directory that contains some data outside of the base directory'),
('In the following subsections, the layout of a database cluster, databases, files associated with tables and indexes, and tablespaces in Fujitsu Enterprise Postgres are described'),
('The layout of database cluster has been described in the official document'),
('Main files and subdirectories in a part of the document have been listed in Table 1.1'),
('jumps fox The dog brown the quick lazy over'),
('A database is a subdirectory under the base subdirectory'),
('The database directory names are identical to the respective OIDs'),
('For example, when the OID of the database ‘sampledb’ is 16384, its subdirectory name is 16384'),
('Each table or index whose size is less than 1GB is stored in a single file under the database directory to which it belongs'),
('Tables and indexes are internally managed by individual OIDs, while their data files are managed by the variable, relfilenode'),
('The relfilenode values of tables and indexes basically but not always match the respective OIDs, the details are described below'); - Create a GIN index
CREATE INDEX gin_content_idx ON documents USING GIN (to_tsvector('english', content));
- Query with GIN index
postgres=# EXPLAIN ANALYZE
postgres-# SELECT id, content FROM documents
postgres-# WHERE to_tsvector('english', content) @@ to_tsquery('fox & quick');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=12.25..16.76 rows=1 width=36) (actual time=0.029..0.032 rows=1
loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, content) @@ to_tsquery('fox & quick'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on gin_content_idx (cost=0.00..12.25 rows=1 width=0) (actual time=0.021..0.022
rows=1 loops=1)
Index Cond: (to_tsvector('english'::regconfig, content) @@ to_tsquery('fox & quick'::text))
Planning Time: 0.140 ms
Execution Time: 0.071 ms
(7 rows)
Searching with pg_bigm
By using pg_bigm, you can significantly reduce the time taken for full-text searches, especially in large datasets or in applications where speed and accuracy are crucial.
To implement pg_bigm in a Fujitsu Enterprise Postgres database, you need to install the extension and to follow these steps (the process may vary depending on your operating system and Fujitsu Enterprise Postgres version).
- Create the extension.
CREATE EXTENSION pg_bigm;
The pg_bigm module can be obtained here. Installation varies depending on the operating system and Fujitsu Enterprise Postgres version.
- Add content to be searched
CREATE TABLE bigm_documents (id SERIAL PRIMARY KEY, content TEXT );
INSERT INTO bigm_documents (content) VALUES… - Create a pg_bigm index
CREATE INDEX bigm_content_idx ON bigm_documents USING gin (content gin_bigm_ops);
- Query with pg_bigm index
postgres=# EXPLAIN ANALYZE
postgres-# SELECT id, content FROM bigm_documents WHERE content LIKE '%fox%quick%';
QUERY PLAN
Seq Scan on bigm_documents (cost=0.00..1.44 rows=1 width=36) (actual time=0.027..0.031 rows=1 loops=1)
Filter: (content ~~ '%fox%quick%'::text) Rows Removed by Filter: 34
Planning Time: 0.101 ms
Execution Time: 0.057 ms
(5 rows)