In this article, we share empirical measurements from our experiments in using Postgres to index and search over 10,000 top GitHub repositories using pg_trgm on only a Macbook.

This is a follow up to “Postgres Trigram search learnings”, in which we shared several learnings and beliefs about trying to use Postgres Trigram indexes as an alterative to Google’s Zoekt (“Fast trigram based code search”).

We share our results, as well as the exact steps we performed, scripts, and lists of the top 20,000 repositories by stars/language on GitHub so you can reproduce the results yourself should you desire.

TL;DR

This article is extensive and more akin to a research paper than a blog post. If you’re interested in our conclusions, see conclusions instead.

Goals

We wanted to get empirical measurements for how suitable Postgres is in providing regexp search over documents, e.g. as an alterative to Google’s Zoekt (“Fast trigram based code search”). In specific:

  • How many repositories can we index on just a 2019 Macbook Pro?
  • How fast are different regexp searches over the corpus?
  • What Postgres 13 configuration gives best results?
  • What other operational effects need consideration if seriously attempting to use Postgres as the backend for a regexp search engine?
  • What is the best database schema to use?

Hardware

We ran all tests on a 2019 Macbook Pro with:

  • 2.3 GHz 8-Core Intel Core i9
  • 16 GB 2667 MHz DDR4

During test execution, few other Mac applications were in use such that effectively all CPU/memory was available to Postgres.

Corpus

We scraped lists of the top 1,000 repositories from the GitHub search API ranked by stars for each of the following languages (~20.5k repositories in total):

  • C++, C#, CSS, Go, HTML, Java, JavaScript, MatLab, ObjC, Perl, PHP, Python, Ruby, Rust, Shell, Solidity, Swift, TypeScript, VB .NET, and Zig.

Cloning all ~20.5k repositories in parallel took ~14 hours with a fast ~100 Mbps connection to GitHub’s servers.

Dataset reduction

We found the amount of disk space required by git clone --depth 1 on these repositories to be a sizable ~412G for just 12,148 repositories - and so we put in place several processes for further reduce the dataset size by about 66%:

  • Removing .git directories resulted in a 30% reduction (412G -> 290G, for 12,148 repositories)
  • Removing files > 1 MiB resulted in another 51% reduction (290G -> 142G, for 12,148 repositories - note GitHub does not index files > 384 KiB in their search engine)

Database insertion

We concurrently inserted the entire corpus into Postgres, with the following DB schema:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE IF NOT EXISTS files (
    id bigserial PRIMARY KEY,
    contents text NOT NULL,
    filepath text NOT NULL
);

In total, this took around ~8 hours to complete and Postgres’s entire on-disk utilization was 101G.

Creating the Trigram index

We tried three separate times to index the dataset using the following GIN Trigram index:

CREATE INDEX IF NOT EXISTS files_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);
  • In the first attempt, we hit an OOM after 11 hours and 34 minutes. This was due to a rapid spike in memory usage at the very end of indexing. We used a fairly aggressive Postgres configuration with a very large max WAL size, so it was not entirely unexpected.
  • In the second attempt, we ran out of SSD disk space after ~27 hours. Notable is that the disk space largely grew towards the end of indexing, similar to when we faced an OOM - it was not a gradual increase over time. For this attempt, we used the excellent pgtune tool to reduce our first Postgres configuration as follows:
shared_buffers = 4GB → 2560MB
effective_cache_size = 12GB → 7680MB
maintenance_work_mem = 16GB → 1280MB
default_statistics_target = 100 → 500
work_mem = 5242kB → 16MB
min_wal_size = 50GB → 4GB
max_wal_size = 4GB → 16GB
max_parallel_workers_per_gather = 8 → 4
max_parallel_maintenance_workers = 8 → 4
  • In our third and final attempt, we cut the dataset in half and indexing succeeded after 22 hours. In specific, we deleted half of the files in the database (from 19,441,820 files / 178GiB of data to 9,720,910 files / 82 GiB of data.) The Postgres configuration used was the same as in attempt 2.

Indexing performance: Memory usage

In our first attempt, we see the reported docker stats memory usage of the container grow up to 12 GiB (chart shows MiB of memory used over time):

image

In our second and third attempts, we see far less memory usage (~1.6 GiB consistently):

image

image

Indexing performance: CPU usage

Postgres’ Trigram indexing appears to be mostly single-threaded (at least when indexing a single table, we test multiple tables later.)

In our first attempt, CPU usage for the container did not rise above 156% (one and a half virtual CPU cores):

image

Our second attempt was around 150-200% CPU usage on average:

image

Our third attempt similarly saw an average of 150-200%, but with a brief spike towards the end to ~350% CPU:

image

Indexing performance: Disk IO

Disk reads/writes during indexing averaged about ~250 MB/s for reads (blue) and writes (red). Native in-software tests show the same Macbook able to achieve read/write speeds of ~860 MB/s with <5% affect on CPU utilization.

Addition made Feb 20, 2021: We ran tests using native Postgres as well (instead of in Docker with a bind mount) and found better indexing and query performance, more on this below.

image

Indexing performance: Disk space

The database contains 9,720,910 files totalling 82.07 GiB:

postgres=# select count(filepath) from files;
  count  
---------
 9720910
(1 row)

postgres=# select SUM(octet_length(contents)) from files;
     sum     
-------------
 88123563320
(1 row)

Before indexing, we find that all of Postgres is consuming 54G:

$ du -sh .postgres/
 54G	.postgres/

After CREATE INDEX, Postgres uses:

$ du -sh .postgres/
 73G	.postgres/

Thus, the index size for 82 GiB of text is 19 GiB (or 23% of the data size.)

Database startup times

From an operational standpoint, it is worth noting that if Postgres is starting clean (i.e. previous shutdown was graceful) then startup time is almost instantaneous: it begins accepting connections immediately and loads the index as needed.

However, if Postgres experienced a non-graceful termination during e.g. startup, it can take a hefty ~10 minutes with this dataset to start as it goes through an automated recovery process.

Queries executed

In total, we executed 19,936 search queries against the index. We chose queries which we expect give reasonably varying amounts of coverage over the trigram index (that is, queries whose trigrams are more or less likely to occur in many files):

Regexp query Matching # files in entire dataset
var unknown (2m+ suspected)
error 1,479,452
123456789 59,841
fmt\.Error 127,895
fmt\.Println 22,876
bytes.Buffer 34,554
fmt\.Print.* 37,319
ac8ac5d63b66b83b90ce41a2d4061635 0
d97f1d3ff91543[e-f]49.8b07517548877 0
Detailed breakdown
Query Result Limit Times executed
var 10 1000
var 100 1000
var 1000 100
var unlimited 4
error' 10 2000
error' 100 2000
error' 1000 200
error' unlimited 18
123456789 10 1000
123456789 100 1000
123456789 1000 100
123456789 unlimited 2
fmt\.Error 10 1000
fmt\.Error 100 1000
fmt\.Error 1000 100
fmt\.Error unlimited 2
fmt\.Println 10 1000
fmt\.Println 100 1000
fmt\.Println 1000 100
fmt\.Println unlimited 2
bytes.Buffer 10 4
bytes.Buffer 100 4
bytes.Buffer 1000 4
bytes.Buffer unlimited 2
fmt\.Print.* 10 1000
fmt\.Print.* 100 1000
fmt\.Print.* 1000 100
fmt\.Print.* unlimited 2
ac8ac5d63b66b83b90ce41a2d4061635 10 1000
ac8ac5d63b66b83b90ce41a2d4061635 100 1000
ac8ac5d63b66b83b90ce41a2d4061635 1000 100
ac8ac5d63b66b83b90ce41a2d4061635 unlimited 2
d97f1d3ff91543[e-f]49.8b07517548877 10 1000
d97f1d3ff91543[e-f]49.8b07517548877 100 1000
d97f1d3ff91543[e-f]49.8b07517548877 1000 100
d97f1d3ff91543[e-f]49.8b07517548877 unlimited 2

Query performance

In total, we executed 19,936 search queries against the database (linearly, not in parallel) which completed in the following times:

Time bucket Percentage of queries Number of queries
Under 50ms 30% 5,933
Under 250ms 41% 8,088
Under 500ms 52% 10,275
Under 750ms 63% 12,473
Under 1s 68% 13,481
Under 1.5s 74% 14,697
Under 3s 79% 15,706
Under 25s 79% 15,708
Under 30s 99% 19,788

Query performance vs. planning time

The following scatter plot shows how 79% of queries executed in under 3s (Y axis, in ms), while Postgres’s query planner had planned them for execution in under 100-250ms generally (X axis, in ms):

image

If we expand the view to include all queries, we start to get a picture of just how outlier these 21% of queries are (note that the small block of dots in the bottom left represents the same diagram shown above):

image

Query time vs. CPU & Memory usage

The following image shows:

  • (top) Query time in milliseconds
  • (middle) CPU usage percentage (e.g. 801% refers to 8 out of 16 virtual CPU cores being consumed)
  • (bottom) Memory usage in MiB.

image

Notable insights from this are:

  • The large increase in resource usage towards the end is when we began executing queries with no LIMIT.
  • CPU usage does not exceed 138%, until the spike at the end.
  • Memory usage does not exceed 42 MiB, until the spike at the end.

We suspect pg_trgm is single-threaded within the scope of a single table, but with table data partitioning (or splitting data into multiple tables with subsets of the data), we suspect better parallelism could be achieved.

Investigating slow queries

If we plot the number of index rechecks (X axis) vs. execution time (Y axis), we can clearly see one of the most significant aspects of slow queries is that they have many more index rechecks:

image

And if we look at the EXPLAIN ANALYZE output for one of these queries we can also confirm Parallel Bitmap Heap Scan is slow due to Rows Removed by Index Recheck.

Table splitting

Splitting up the search index into multiple smaller tables seems like an obvious approach to getting pg_trgm to use multiple CPU cores. We tried this by taking the same exact data set and splitting it into 200 tables, and found numerous benefits:

Benefit 1: Incremental indexing

If indexing fails after 11-27 hours, as happened to us twice in the non-splitting approach, all progress is not lost.

Benefit 2: Parallel indexing

Unlike our first non-splitting approach, which showed we were only able to utilize 1.5-2 virtual CPU cores, with multiple tables we are able to utilize 8-9 virtual CPU cores:

image

Benefit 3: Indexing is 84% faster

Unlike our first attempt which took 22 hours in total, parallel indexing completed in only 3h27m.

Benefit 4: Indexing uses 69% less memory

With non-splitting we saw peak memory usage up to 12 GiB. With the same exact Postgres configuration, we were able to index with only 3.7 GiB peak memory usage:

image

Benefit 4: Parallel querying

Previously, we saw CPU utilization of only 138% (1.3 virtual CPU cores), with table splitting we see CPU utilization during queries of 1600% (16 virtual CPU cores) showing we are doing work fully in parallel:

image

Similarly, we saw memory usage average around ~380 MiB, compared to only ~42 MiB before:

image

Benefit 5: Query performance

We reran the same exact set of search queries, but a smaller number of times overall (350 queries, instead of 19.9k - which we found to still be a representative enough sample.)

As we can see below, table splitting in general led to a 200-300% improvement in query time for heavier queries that previously took 20-30s, now taking only 7-15s thanks to parallel querying (top chart is before, bottom is after, both in milliseconds):

image

We also grouped queries based on the LIMIT specified in the query and placed them into time buckets (“how many queries completed in under 50ms?”) - comparing the two shows that less complex queries and/or queries for fewer results were negatively affected slightly, while larger queries were helped substantially:

Change (positive is good) Results limit Bucket Queries in bucket before Queries in bucket after
-33% 10 <50ms 33% 0%
+13% 10 <250ms 44% 57%
+33% 10 <1s 77% 100%
-29% 100 <100ms 29% 0%
+20% 100 <500ms 50% 70%
+19% 100 <10s 80% 99%
-12% 1000 <250ms 12% 0%
-13% 1000 <2.5s 77% 64%
+23% 1000 <20s 77% 100%
+4% none <20s 0% 4%
+18% none <60s 0% 18%

Detailed comparisons are available below for those interested:

Queries with `LIMIT 10`
Time bucket Percentage of queries (before) Percentage of queries (after splitting)
50ms 33.00% (2999 of 9004) 0% (0 of 100)
100ms 33.00% (2999 of 9004) 1.00% (1 of 100)
250ms 44.00% (3999 of 9004) 57.00% (57 of 100)
500ms 55.00% (4999 of 9004) 79.00% (79 of 100)
1000ms 77.00% (6998 of 9004) 80.00% (80 of 100)
2500ms 77.00% (7003 of 9004) 80.00% (80 of 100)
5000ms 77.00% (7004 of 9004) 80.00% (80 of 100)
10000ms 77.00% (7004 of 9004) 100.00% (100 of 100)
20000ms 77.00% (7004 of 9004) 100.00% (100 of 100)
30000ms 99.00% (8985 of 9004) 100.00% (100 of 100)
40000ms 99.00% (9003 of 9004) 100.00% (100 of 100)
50000ms 100.00% (9004 of 9004) 100.00% (100 of 100)
60000ms 100.00% (9004 of 9004) 100.00% (100 of 100)
Queries with `LIMIT 100`
Time bucket Percentage of queries (before) Percentage of queries (after splitting)
50ms 29.00% (2934 of 10000) 0% (0 of 100)
100ms 29.00% (2978 of 10000) 0% (0 of 100)
250ms 39.00% (3975 of 10000) 31.00% (31 of 100)
500ms 50.00% (5000 of 10000) 70.00% (70 of 100)
1000ms 59.00% (5984 of 10000) 79.00% (79 of 100)
2500ms 79.00% (7996 of 10000) 80.00% (80 of 100)
5000ms 80.00% (8000 of 10000) 80.00% (80 of 100)
10000ms 80.00% (8000 of 10000) 99.00% (99 of 100)
20000ms 80.00% (8000 of 10000) 100.00% (100 of 100)
30000ms 99.00% (9999 of 10000) 100.00% (100 of 100)
40000ms 100.00% (10000 of 10000) 100.00% (100 of 100)
50000ms 100.00% (10000 of 10000) 100.00% (100 of 100)
60000ms 100.00% (10000 of 10000) 100.00% (100 of 100)
Queries with `LIMIT 1000`
Time bucket Percentage of queries (before) Percentage of queries (after splitting)
50ms 0% (0 of 904) 0% (0 of 100)
100ms 0% (1 of 904) 0% (0 of 100)
250ms 12.00% (114 of 904) 0% (0 of 100)
500ms 30.00% (276 of 904) 21.00% (21 of 100)
1000ms 55.00% (499 of 904) 41.00% (41 of 100)
2500ms 77.00% (700 of 904) 64.00% (64 of 100)
5000ms 77.00% (704 of 904) 77.00% (77 of 100)
10000ms 77.00% (704 of 904) 98.00% (98 of 100)
20000ms 77.00% (704 of 904) 100.00% (100 of 100)
30000ms 88.00% (804 of 904) 100.00% (100 of 100)
40000ms 99.00% (901 of 904) 100.00% (100 of 100)
50000ms 99.00% (903 of 904) 100.00% (100 of 100)
60000ms 100.00% (904 of 904) 100.00% (100 of 100)
Queries with no limit`
Time bucket Percentage of queries (before) Percentage of queries (after splitting)
50ms 0% (0 of 28) 0% (0 of 50)
100ms 0% (0 of 28) 0% (0 of 50)
250ms 0% (0 of 28) 0% (0 of 50)
500ms 0% (0 of 28) 0% (0 of 50)
1000ms 0% (0 of 28) 0% (0 of 50)
2500ms 0% (0 of 28) 0% (0 of 50)
5000ms 0% (0 of 28) 0% (0 of 50)
10000ms 0% (0 of 28) 0% (0 of 50)
20000ms 0% (0 of 28) 4.00% (2 of 50)
30000ms 0% (0 of 28) 16.00% (8 of 50)
40000ms 0% (0 of 28) 16.00% (8 of 50)
50000ms 0% (0 of 28) 18.00% (9 of 50)
60000ms 0% (0 of 28) 18.00% (9 of 50)

Postgres-in-Docker vs. native Postgres

Addition made Feb 20, 2021

In our original article we did not clarify the performance impacts of running Postgres inside of Docker with a volume bind mount. This was raised as a potential source of IO performance difference to us by Thorsten Ball.

We ran all tests above with Postgres in Docker, using a volume bind mount (the osxfs driver, not the experimental FUSE gRPC driver.)

We additionally ran the same table-splitting benchmarks on a native Postgres server (reproduction steps here) and found the following key changes:

CPU usage & memory usage: approximately the same

CPU and memory usage was approximately the same as in our Docker Postgres tests.

We anticipated this would be the case as the Macbook does have VT-x virtualization enabled (default on all i7/i9 Macbooks, and confirmed through sysctl kern.hv_support)

Indexing speed was ~88% faster

Running the statements to split up the large table into multiple smaller ones, i.e.:

CREATE TABLE files_000 AS SELECT * FROM files WHERE id > 0 AND id < 50000;
CREATE TABLE files_001 AS SELECT * FROM files WHERE id > 50000 AND id < 100000;
...

Was much faster in native Postgres, taking about 2-8s for each table instead of 20-40s previously, and taking only 15m in total instead of 2h before.

Parallel creation of the Trigram indexes using e.g.:

CREATE INDEX IF NOT EXISTS files_000_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);

Was also much faster, taking only 23m compared to ~3h with Docker.

Query performance is 12-99% faster, depending on query

We re-ran the same 350 queries as in our earlier table-splitting benchmark, and found the following substantial improvements:

  1. Queries that were previously very slow noticed a ~12% improvement. This is likely due to IO operations needed when interfacing with the 200 separate tables.
  2. Queries that were previously in the middle-ground noticed meager ~5% improvements.
  3. Queries that were previously fairly fast (likely searching only over a one or two tables before returning) noticed substantial 16-99% improvements.
Exhaustive comparison details (negative change is good)
Change Time bucket Queries under bucket before Queries under bucket after
0% 500s 350 of 350 350 of 350
-12% 100s 309 of 350 350 of 350
-12% 50s 309 of 350 350 of 350
-12% 40s 308 of 350 350 of 350
-12% 30s 308 of 350 349 of 350
-7% 25s 307 of 350 330 of 350
-7% 25s 307 of 350 330 of 350
-8% 20s 302 of 350 330 of 350
-8% 20s 302 of 350 330 of 350
-5% 10s 297 of 350 311 of 350
-26% 5s 237 of 350 319 of 350
-7% 2500ms 224 of 350 240 of 350
-9% 2000ms 219 of 350 240 of 350
-9% 1500ms 219 of 350 240 of 350
-16% 1000ms 200 of 350 237 of 350
-14% 750ms 190 of 350 221 of 350
-23% 500ms 170 of 350 220 of 350
-59% 250ms 88 of 350 217 of 350
-99% 100ms 1 of 350 168 of 350
-99% 50ms 1 of 350 168 of 350

Conclusions

We think the following learnings are most important:

  • .git directories, even with --depth=1 clones, account for 30% of a repositories size on disk (at least in top 10,000 GitHub repositories.)
  • Files > 1 MiB (often binaries) account for another 51% of the data size on disk of repositories.
  • On only a Macbook Pro, it is possible to get Postgres Trigram regex search over 10,000 repositories to run most reasonable queries in under 5s - and certainly much faster with more hardware.
  • pg_trgm performs single-threaded indexing and querying, unless you split your data up into multiple tables.
  • By default, a Postgres text colum will be compressed by Postgres on disk out of the box - resulting in a 23% reduction in size (with the files we inserted.)
  • pg_trgm GIN indexes take around 26% the size of your data on disk. So if indexing 1 GiB of raw text, expect Postgres to store that text in around ~827 MiB plus 279 MiB for the GIN trigram index.
  • Splitting your data into multiple tables if using pg_trgm is an obvious win, as it allows for paralle indexing which can be the difference between 4h vs 22h. It also reduces the risk of an indexing failure after 22h due to e.g. lack of memory and uses much less peak memory overall.
  • Docker bind mounts (not volumes) are quite slow outside of Linux host environments (there are many other articles on this subject.)

If you are looking for fast regexp or code search today, consider:

  • Sourcegraph (disclaimer: the author works here, but this article is not endorsed or affiliated in any way)
  • Zoekt
  • Ripgrep

Follow this devlog for updates as we continue investigating faster ways to do regexp & ngram search at large scales.