Postgres regex search over 10,000 GitHub repositories (using only a Macbook)

In this article, we share empirical measurements from our experiments in using Postgres to index and search over 10,000 top GitHub repositories using pg_trgm on only a Macbook.

This is a follow up to “Postgres Trigram search learnings”, in which we shared several learnings and beliefs about trying to use Postgres Trigram indexes as an alterative to Google’s Zoekt (“Fast trigram based code search”).

We share our results, as well as the exact steps we performed, scripts, and lists of the top 20,000 repositories by stars/language on GitHub so you can reproduce the results yourself should you desire.

TL;DR

This article is extensive and more akin to a research paper than a blog post. If you’re interested in our conclusions, see conclusions instead.

Goals

We wanted to get empirical measurements for how suitable Postgres is in providing regexp search over documents, e.g. as an alterative to Google’s Zoekt (“Fast trigram based code search”). In specific:

How many repositories can we index on just a 2019 Macbook Pro?
How fast are different regexp searches over the corpus?
What Postgres 13 configuration gives best results?
What other operational effects need consideration if seriously attempting to use Postgres as the backend for a regexp search engine?
What is the best database schema to use?

Hardware

We ran all tests on a 2019 Macbook Pro with:

2.3 GHz 8-Core Intel Core i9
16 GB 2667 MHz DDR4

During test execution, few other Mac applications were in use such that effectively all CPU/memory was available to Postgres.

Corpus

We scraped lists of the top 1,000 repositories from the GitHub search API ranked by stars for each of the following languages (~20.5k repositories in total):

C++, C#, CSS, Go, HTML, Java, JavaScript, MatLab, ObjC, Perl, PHP, Python, Ruby, Rust, Shell, Solidity, Swift, TypeScript, VB .NET, and Zig.

Cloning all ~20.5k repositories in parallel took ~14 hours with a fast ~100 Mbps connection to GitHub’s servers.

Dataset reduction

We found the amount of disk space required by git clone --depth 1 on these repositories to be a sizable ~412G for just 12,148 repositories - and so we put in place several processes for further reduce the dataset size by about 66%:

Removing .git directories resulted in a 30% reduction (412G -> 290G, for 12,148 repositories)
Removing files > 1 MiB resulted in another 51% reduction (290G -> 142G, for 12,148 repositories - note GitHub does not index files > 384 KiB in their search engine)

Database insertion

We concurrently inserted the entire corpus into Postgres, with the following DB schema:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE IF NOT EXISTS files (
    id bigserial PRIMARY KEY,
    contents text NOT NULL,
    filepath text NOT NULL
);

In total, this took around ~8 hours to complete and Postgres’s entire on-disk utilization was 101G.

Creating the Trigram index

We tried three separate times to index the dataset using the following GIN Trigram index:

CREATE INDEX IF NOT EXISTS files_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);

In the first attempt, we hit an OOM after 11 hours and 34 minutes. This was due to a rapid spike in memory usage at the very end of indexing. We used a fairly aggressive Postgres configuration with a very large max WAL size, so it was not entirely unexpected.
In the second attempt, we ran out of SSD disk space after ~27 hours. Notable is that the disk space largely grew towards the end of indexing, similar to when we faced an OOM - it was not a gradual increase over time. For this attempt, we used the excellent pgtune tool to reduce our first Postgres configuration as follows:

shared_buffers = 4GB → 2560MB
effective_cache_size = 12GB → 7680MB
maintenance_work_mem = 16GB → 1280MB
default_statistics_target = 100 → 500
work_mem = 5242kB → 16MB
min_wal_size = 50GB → 4GB
max_wal_size = 4GB → 16GB
max_parallel_workers_per_gather = 8 → 4
max_parallel_maintenance_workers = 8 → 4

In our third and final attempt, we cut the dataset in half and indexing succeeded after 22 hours. In specific, we deleted half of the files in the database (from 19,441,820 files / 178GiB of data to 9,720,910 files / 82 GiB of data.) The Postgres configuration used was the same as in attempt 2.

Indexing performance: Memory usage

In our first attempt, we see the reported docker stats memory usage of the container grow up to 12 GiB (chart shows MiB of memory used over time):

In our second and third attempts, we see far less memory usage (~1.6 GiB consistently):

Indexing performance: CPU usage

Postgres' Trigram indexing appears to be mostly single-threaded (at least when indexing a single table, we test multiple tables later.)

In our first attempt, CPU usage for the container did not rise above 156% (one and a half virtual CPU cores):

Our second attempt was around 150-200% CPU usage on average:

Our third attempt similarly saw an average of 150-200%, but with a brief spike towards the end to ~350% CPU:

Indexing performance: Disk IO

Disk reads/writes during indexing averaged about ~250 MB/s for reads (blue) and writes (red). Native in-software tests show the same Macbook able to achieve read/write speeds of ~860 MB/s with <5% affect on CPU utilization.

Addition made Feb 20, 2021: We ran tests using native Postgres as well (instead of in Docker with a bind mount) and found better indexing and query performance, more on this below.

Indexing performance: Disk space

The database contains 9,720,910 files totalling 82.07 GiB:

postgres=# select count(filepath) from files;
  count  
---------
 9720910
(1 row)

postgres=# select SUM(octet_length(contents)) from files;
     sum     
-------------
 88123563320
(1 row)

Before indexing, we find that all of Postgres is consuming 54G:

$ du -sh .postgres/
 54G	.postgres/

After CREATE INDEX, Postgres uses:

$ du -sh .postgres/
 73G	.postgres/

Thus, the index size for 82 GiB of text is 19 GiB (or 23% of the data size.)

Database startup times

From an operational standpoint, it is worth noting that if Postgres is starting clean (i.e. previous shutdown was graceful) then startup time is almost instantaneous: it begins accepting connections immediately and loads the index as needed.

However, if Postgres experienced a non-graceful termination during e.g. startup, it can take a hefty ~10 minutes with this dataset to start as it goes through an automated recovery process.

Queries executed

In total, we executed 19,936 search queries against the index. We chose queries which we expect give reasonably varying amounts of coverage over the trigram index (that is, queries whose trigrams are more or less likely to occur in many files):

Regexp query	Matching # files in entire dataset
`var`	unknown (2m+ suspected)
`error`	1,479,452
`123456789`	59,841
`fmt\.Error`	127,895
`fmt\.Println`	22,876
`bytes.Buffer`	34,554
`fmt\.Print.*`	37,319
`ac8ac5d63b66b83b90ce41a2d4061635`	0
`d97f1d3ff91543[e-f]49.8b07517548877`	0

Detailed breakdown

Query	Result Limit	Times executed
`var`	10	1000
`var`	100	1000
`var`	1000	100
`var`	unlimited	4
`error'`	10	2000
`error'`	100	2000
`error'`	1000	200
`error'`	unlimited	18
`123456789`	10	1000
`123456789`	100	1000
`123456789`	1000	100
`123456789`	unlimited	2
`fmt\.Error`	10	1000
`fmt\.Error`	100	1000
`fmt\.Error`	1000	100
`fmt\.Error`	unlimited	2
`fmt\.Println`	10	1000
`fmt\.Println`	100	1000
`fmt\.Println`	1000	100
`fmt\.Println`	unlimited	2
`bytes.Buffer`	10	4
`bytes.Buffer`	100	4
`bytes.Buffer`	1000	4
`bytes.Buffer`	unlimited	2
`fmt\.Print.*`	10	1000
`fmt\.Print.*`	100	1000
`fmt\.Print.*`	1000	100
`fmt\.Print.*`	unlimited	2
`ac8ac5d63b66b83b90ce41a2d4061635`	10	1000
`ac8ac5d63b66b83b90ce41a2d4061635`	100	1000
`ac8ac5d63b66b83b90ce41a2d4061635`	1000	100
`ac8ac5d63b66b83b90ce41a2d4061635`	unlimited	2
`d97f1d3ff91543[e-f]49.8b07517548877`	10	1000
`d97f1d3ff91543[e-f]49.8b07517548877`	100	1000
`d97f1d3ff91543[e-f]49.8b07517548877`	1000	100
`d97f1d3ff91543[e-f]49.8b07517548877`	unlimited	2

Query performance

In total, we executed 19,936 search queries against the database (linearly, not in parallel) which completed in the following times:

Time bucket	Percentage of queries	Number of queries
Under 50ms	30%	5,933
Under 250ms	41%	8,088
Under 500ms	52%	10,275
Under 750ms	63%	12,473
Under 1s	68%	13,481
Under 1.5s	74%	14,697
Under 3s	79%	15,706
Under 25s	79%	15,708
Under 30s	99%	19,788

Query performance vs. planning time

The following scatter plot shows how 79% of queries executed in under 3s (Y axis, in ms), while Postgres’s query planner had planned them for execution in under 100-250ms generally (X axis, in ms):

If we expand the view to include all queries, we start to get a picture of just how outlier these 21% of queries are (note that the small block of dots in the bottom left represents the same diagram shown above):

Query time vs. CPU & Memory usage

The following image shows:

(top) Query time in milliseconds
(middle) CPU usage percentage (e.g. 801% refers to 8 out of 16 virtual CPU cores being consumed)
(bottom) Memory usage in MiB.

Notable insights from this are:

The large increase in resource usage towards the end is when we began executing queries with no LIMIT.
CPU usage does not exceed 138%, until the spike at the end.
Memory usage does not exceed 42 MiB, until the spike at the end.

We suspect pg_trgm is single-threaded within the scope of a single table, but with table data partitioning (or splitting data into multiple tables with subsets of the data), we suspect better parallelism could be achieved.

Investigating slow queries

If we plot the number of index rechecks (X axis) vs. execution time (Y axis), we can clearly see one of the most significant aspects of slow queries is that they have many more index rechecks:

And if we look at the EXPLAIN ANALYZE output for one of these queries we can also confirm Parallel Bitmap Heap Scan is slow due to Rows Removed by Index Recheck.

Table splitting

Splitting up the search index into multiple smaller tables seems like an obvious approach to getting pg_trgm to use multiple CPU cores. We tried this by taking the same exact data set and splitting it into 200 tables, and found numerous benefits:

Benefit 1: Incremental indexing

If indexing fails after 11-27 hours, as happened to us twice in the non-splitting approach, all progress is not lost.

Benefit 2: Parallel indexing

Unlike our first non-splitting approach, which showed we were only able to utilize 1.5-2 virtual CPU cores, with multiple tables we are able to utilize 8-9 virtual CPU cores:

Benefit 3: Indexing is 84% faster

Unlike our first attempt which took 22 hours in total, parallel indexing completed in only 3h27m.

Benefit 4: Indexing uses 69% less memory

With non-splitting we saw peak memory usage up to 12 GiB. With the same exact Postgres configuration, we were able to index with only 3.7 GiB peak memory usage:

Benefit 4: Parallel querying

Previously, we saw CPU utilization of only 138% (1.3 virtual CPU cores), with table splitting we see CPU utilization during queries of 1600% (16 virtual CPU cores) showing we are doing work fully in parallel:

Similarly, we saw memory usage average around ~380 MiB, compared to only ~42 MiB before:

Benefit 5: Query performance

We reran the same exact set of search queries, but a smaller number of times overall (350 queries, instead of 19.9k - which we found to still be a representative enough sample.)

As we can see below, table splitting in general led to a 200-300% improvement in query time for heavier queries that previously took 20-30s, now taking only 7-15s thanks to parallel querying (top chart is before, bottom is after, both in milliseconds):

We also grouped queries based on the LIMIT specified in the query and placed them into time buckets (“how many queries completed in under 50ms?") - comparing the two shows that less complex queries and/or queries for fewer results were negatively affected slightly, while larger queries were helped substantially:

Change (positive is good)	Results limit	Bucket	Queries in bucket before	Queries in bucket after
-33%	10	<50ms	33%	0%
+13%	10	<250ms	44%	57%
+33%	10	<1s	77%	100%
-29%	100	<100ms	29%	0%
+20%	100	<500ms	50%	70%
+19%	100	<10s	80%	99%
-12%	1000	<250ms	12%	0%
-13%	1000	<2.5s	77%	64%
+23%	1000	<20s	77%	100%
+4%	none	<20s	0%	4%
+18%	none	<60s	0%	18%

Detailed comparisons are available below for those interested:

Queries with `LIMIT 10`

Time bucket	Percentage of queries (before)	Percentage of queries (after splitting)
50ms	33.00% (2999 of 9004)	0% (0 of 100)
100ms	33.00% (2999 of 9004)	1.00% (1 of 100)
250ms	44.00% (3999 of 9004)	57.00% (57 of 100)
500ms	55.00% (4999 of 9004)	79.00% (79 of 100)
1000ms	77.00% (6998 of 9004)	80.00% (80 of 100)
2500ms	77.00% (7003 of 9004)	80.00% (80 of 100)
5000ms	77.00% (7004 of 9004)	80.00% (80 of 100)
10000ms	77.00% (7004 of 9004)	100.00% (100 of 100)
20000ms	77.00% (7004 of 9004)	100.00% (100 of 100)
30000ms	99.00% (8985 of 9004)	100.00% (100 of 100)
40000ms	99.00% (9003 of 9004)	100.00% (100 of 100)
50000ms	100.00% (9004 of 9004)	100.00% (100 of 100)
60000ms	100.00% (9004 of 9004)	100.00% (100 of 100)

Queries with `LIMIT 100`

Time bucket	Percentage of queries (before)	Percentage of queries (after splitting)
50ms	29.00% (2934 of 10000)	0% (0 of 100)
100ms	29.00% (2978 of 10000)	0% (0 of 100)
250ms	39.00% (3975 of 10000)	31.00% (31 of 100)
500ms	50.00% (5000 of 10000)	70.00% (70 of 100)
1000ms	59.00% (5984 of 10000)	79.00% (79 of 100)
2500ms	79.00% (7996 of 10000)	80.00% (80 of 100)
5000ms	80.00% (8000 of 10000)	80.00% (80 of 100)
10000ms	80.00% (8000 of 10000)	99.00% (99 of 100)
20000ms	80.00% (8000 of 10000)	100.00% (100 of 100)
30000ms	99.00% (9999 of 10000)	100.00% (100 of 100)
40000ms	100.00% (10000 of 10000)	100.00% (100 of 100)
50000ms	100.00% (10000 of 10000)	100.00% (100 of 100)
60000ms	100.00% (10000 of 10000)	100.00% (100 of 100)

Queries with `LIMIT 1000`

Time bucket	Percentage of queries (before)	Percentage of queries (after splitting)
50ms	0% (0 of 904)	0% (0 of 100)
100ms	0% (1 of 904)	0% (0 of 100)
250ms	12.00% (114 of 904)	0% (0 of 100)
500ms	30.00% (276 of 904)	21.00% (21 of 100)
1000ms	55.00% (499 of 904)	41.00% (41 of 100)
2500ms	77.00% (700 of 904)	64.00% (64 of 100)
5000ms	77.00% (704 of 904)	77.00% (77 of 100)
10000ms	77.00% (704 of 904)	98.00% (98 of 100)
20000ms	77.00% (704 of 904)	100.00% (100 of 100)
30000ms	88.00% (804 of 904)	100.00% (100 of 100)
40000ms	99.00% (901 of 904)	100.00% (100 of 100)
50000ms	99.00% (903 of 904)	100.00% (100 of 100)
60000ms	100.00% (904 of 904)	100.00% (100 of 100)

Queries with no limit`

Time bucket	Percentage of queries (before)	Percentage of queries (after splitting)
50ms	0% (0 of 28)	0% (0 of 50)
100ms	0% (0 of 28)	0% (0 of 50)
250ms	0% (0 of 28)	0% (0 of 50)
500ms	0% (0 of 28)	0% (0 of 50)
1000ms	0% (0 of 28)	0% (0 of 50)
2500ms	0% (0 of 28)	0% (0 of 50)
5000ms	0% (0 of 28)	0% (0 of 50)
10000ms	0% (0 of 28)	0% (0 of 50)
20000ms	0% (0 of 28)	4.00% (2 of 50)
30000ms	0% (0 of 28)	16.00% (8 of 50)
40000ms	0% (0 of 28)	16.00% (8 of 50)
50000ms	0% (0 of 28)	18.00% (9 of 50)
60000ms	0% (0 of 28)	18.00% (9 of 50)

Postgres-in-Docker vs. native Postgres

Addition made Feb 20, 2021

In our original article we did not clarify the performance impacts of running Postgres inside of Docker with a volume bind mount. This was raised as a potential source of IO performance difference to us by Thorsten Ball.

We ran all tests above with Postgres in Docker, using a volume bind mount (the osxfs driver, not the experimental FUSE gRPC driver.)

We additionally ran the same table-splitting benchmarks on a native Postgres server (reproduction steps here) and found the following key changes:

CPU usage & memory usage: approximately the same

CPU and memory usage was approximately the same as in our Docker Postgres tests.

We anticipated this would be the case as the Macbook does have VT-x virtualization enabled (default on all i7/i9 Macbooks, and confirmed through sysctl kern.hv_support)

Indexing speed was ~88% faster

Running the statements to split up the large table into multiple smaller ones, i.e.:

CREATE TABLE files_000 AS SELECT * FROM files WHERE id > 0 AND id < 50000;
CREATE TABLE files_001 AS SELECT * FROM files WHERE id > 50000 AND id < 100000;
...

Was much faster in native Postgres, taking about 2-8s for each table instead of 20-40s previously, and taking only 15m in total instead of 2h before.

Parallel creation of the Trigram indexes using e.g.:

CREATE INDEX IF NOT EXISTS files_000_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);

Was also much faster, taking only 23m compared to ~3h with Docker.

Query performance is 12-99% faster, depending on query

We re-ran the same 350 queries as in our earlier table-splitting benchmark, and found the following substantial improvements:

Queries that were previously very slow noticed a ~12% improvement. This is likely due to IO operations needed when interfacing with the 200 separate tables.
Queries that were previously in the middle-ground noticed meager ~5% improvements.
Queries that were previously fairly fast (likely searching only over a one or two tables before returning) noticed substantial 16-99% improvements.

Exhaustive comparison details (negative change is good)

Change	Time bucket	Queries under bucket before	Queries under bucket after
0%	500s	350 of 350	350 of 350
-12%	100s	309 of 350	350 of 350
-12%	50s	309 of 350	350 of 350
-12%	40s	308 of 350	350 of 350
-12%	30s	308 of 350	349 of 350
-7%	25s	307 of 350	330 of 350
-7%	25s	307 of 350	330 of 350
-8%	20s	302 of 350	330 of 350
-8%	20s	302 of 350	330 of 350
-5%	10s	297 of 350	311 of 350
-26%	5s	237 of 350	319 of 350
-7%	2500ms	224 of 350	240 of 350
-9%	2000ms	219 of 350	240 of 350
-9%	1500ms	219 of 350	240 of 350
-16%	1000ms	200 of 350	237 of 350
-14%	750ms	190 of 350	221 of 350
-23%	500ms	170 of 350	220 of 350
-59%	250ms	88 of 350	217 of 350
-99%	100ms	1 of 350	168 of 350
-99%	50ms	1 of 350	168 of 350

Conclusions

We think the following learnings are most important:

.git directories, even with --depth=1 clones, account for 30% of a repositories size on disk (at least in top 10,000 GitHub repositories.)
Files > 1 MiB (often binaries) account for another 51% of the data size on disk of repositories.
On only a Macbook Pro, it is possible to get Postgres Trigram regex search over 10,000 repositories to run most reasonable queries in under 5s - and certainly much faster with more hardware.
pg_trgm performs single-threaded indexing and querying, unless you split your data up into multiple tables.
By default, a Postgres text colum will be compressed by Postgres on disk out of the box - resulting in a 23% reduction in size (with the files we inserted.)
pg_trgm GIN indexes take around 26% the size of your data on disk. So if indexing 1 GiB of raw text, expect Postgres to store that text in around ~827 MiB plus 279 MiB for the GIN trigram index.
Splitting your data into multiple tables if using pg_trgm is an obvious win, as it allows for paralle indexing which can be the difference between 4h vs 22h. It also reduces the risk of an indexing failure after 22h due to e.g. lack of memory and uses much less peak memory overall.
Docker bind mounts (not volumes) are quite slow outside of Linux host environments (there are many other articles on this subject.)

If you are looking for fast regexp or code search today, consider:

Sourcegraph (disclaimer: the author works here, but this article is not endorsed or affiliated in any way)
Zoekt
Ripgrep

Follow this devlog for updates as we continue investigating faster ways to do regexp & ngram search at large scales.