Postgres regex search over 10,000 GitHub repositories (using only a Macbook)

In this article, we share empirical measurements from our experiments in using Postgres to index and search over 10,000 top GitHub repositories using pg_trgm on only a Macbook.

This is a follow up to “Postgres Trigram search learnings”, in which we shared several learnings and beliefs about trying to use Postgres Trigram indexes as an alterative to Google’s Zoekt (“Fast trigram based code search”).

We share our results, as well as the exact steps we performed, scripts, and lists of the top 20,000 repositories by stars/language on GitHub so you can reproduce the results yourself should you desire.

TL;DR

This article is extensive and more akin to a research paper than a blog post. If you’re interested in our conclusions, see conclusions instead.

Goals

We wanted to get empirical measurements for how suitable Postgres is in providing regexp search over documents, e.g. as an alterative to Google’s Zoekt (“Fast trigram based code search”). In specific:

  • How many repositories can we index on just a 2019 Macbook Pro?
  • How fast are different regexp searches over the corpus?
  • What Postgres 13 configuration gives best results?
  • What other operational effects need consideration if seriously attempting to use Postgres as the backend for a regexp search engine?
  • What is the best database schema to use?

Hardware

We ran all tests on a 2019 Macbook Pro with:

  • 2.3 GHz 8-Core Intel Core i9
  • 16 GB 2667 MHz DDR4

During test execution, few other Mac applications were in use such that effectively all CPU/memory was available to Postgres.

Corpus

We scraped lists of the top 1,000 repositories from the GitHub search API ranked by stars for each of the following languages (~20.5k repositories in total):

  • C++, C#, CSS, Go, HTML, Java, JavaScript, MatLab, ObjC, Perl, PHP, Python, Ruby, Rust, Shell, Solidity, Swift, TypeScript, VB .NET, and Zig.

Cloning all ~20.5k repositories in parallel took ~14 hours with a fast ~100 Mbps connection to GitHub’s servers.

Dataset reduction

We found the amount of disk space required by git clone --depth 1 on these repositories to be a sizable ~412G for just 12,148 repositories - and so we put in place several processes for further reduce the dataset size by about 66%:

  • Removing .git directories resulted in a 30% reduction (412G -> 290G, for 12,148 repositories)
  • Removing files > 1 MiB resulted in another 51% reduction (290G -> 142G, for 12,148 repositories - note GitHub does not index files > 384 KiB in their search engine)

Database insertion

We concurrently inserted the entire corpus into Postgres, with the following DB schema:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE IF NOT EXISTS files (
    id bigserial PRIMARY KEY,
    contents text NOT NULL,
    filepath text NOT NULL
);

In total, this took around ~8 hours to complete and Postgres’s entire on-disk utilization was 101G.

Creating the Trigram index

We tried three separate times to index the dataset using the following GIN Trigram index:

CREATE INDEX IF NOT EXISTS files_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);
  • In the first attempt, we hit an OOM after 11 hours and 34 minutes. This was due to a rapid spike in memory usage at the very end of indexing. We used a fairly aggressive Postgres configuration with a very large max WAL size, so it was not entirely unexpected.
  • In the second attempt, we ran out of SSD disk space after ~27 hours. Notable is that the disk space largely grew towards the end of indexing, similar to when we faced an OOM - it was not a gradual increase over time. For this attempt, we used the excellent pgtune tool to reduce our first Postgres configuration as follows:
shared_buffers = 4GB → 2560MB
effective_cache_size = 12GB → 7680MB
maintenance_work_mem = 16GB → 1280MB
default_statistics_target = 100 → 500
work_mem = 5242kB → 16MB
min_wal_size = 50GB → 4GB
max_wal_size = 4GB → 16GB
max_parallel_workers_per_gather = 8 → 4
max_parallel_maintenance_workers = 8 → 4
  • In our third and final attempt, we cut the dataset in half and indexing succeeded after 22 hours. In specific, we deleted half of the files in the database (from 19,441,820 files / 178GiB of data to 9,720,910 files / 82 GiB of data.) The Postgres configuration used was the same as in attempt 2.

Indexing performance: Memory usage

In our first attempt, we see the reported docker stats memory usage of the container grow up to 12 GiB (chart shows MiB of memory used over time):

image

In our second and third attempts, we see far less memory usage (~1.6 GiB consistently):

image image

Indexing performance: CPU usage

Postgres' Trigram indexing appears to be mostly single-threaded (at least when indexing a single table, we test multiple tables later.)

In our first attempt, CPU usage for the container did not rise above 156% (one and a half virtual CPU cores):

image

Our second attempt was around 150-200% CPU usage on average:

image

Our third attempt similarly saw an average of 150-200%, but with a brief spike towards the end to ~350% CPU:

image

Indexing performance: Disk IO

Disk reads/writes during indexing averaged about ~250 MB/s for reads (blue) and writes (red). Native in-software tests show the same Macbook able to achieve read/write speeds of ~860 MB/s with <5% affect on CPU utilization.

Addition made Feb 20, 2021: We ran tests using native Postgres as well (instead of in Docker with a bind mount) and found better indexing and query performance, more on this below.

image

Indexing performance: Disk space

The database contains 9,720,910 files totalling 82.07 GiB:

postgres=# select count(filepath) from files;
  count  
---------
 9720910
(1 row)

postgres=# select SUM(octet_length(contents)) from files;
     sum     
-------------
 88123563320
(1 row)

Before indexing, we find that all of Postgres is consuming 54G:

$ du -sh .postgres/
 54G	.postgres/

After CREATE INDEX, Postgres uses:

$ du -sh .postgres/
 73G	.postgres/

Thus, the index size for 82 GiB of text is 19 GiB (or 23% of the data size.)

Database startup times

From an operational standpoint, it is worth noting that if Postgres is starting clean (i.e. previous shutdown was graceful) then startup time is almost instantaneous: it begins accepting connections immediately and loads the index as needed.

However, if Postgres experienced a non-graceful termination during e.g. startup, it can take a hefty ~10 minutes with this dataset to start as it goes through an automated recovery process.

Queries executed

In total, we executed 19,936 search queries against the index. We chose queries which we expect give reasonably varying amounts of coverage over the trigram index (that is, queries whose trigrams are more or less likely to occur in many files):

Regexp queryMatching # files in entire dataset
varunknown (2m+ suspected)
error1,479,452
12345678959,841
fmt\.Error127,895
fmt\.Println22,876
bytes.Buffer34,554
fmt\.Print.*37,319
ac8ac5d63b66b83b90ce41a2d40616350
d97f1d3ff91543[e-f]49.8b075175488770
Detailed breakdown
QueryResult LimitTimes executed
var101000
var1001000
var1000100
varunlimited4
error'102000
error'1002000
error'1000200
error'unlimited18
123456789101000
1234567891001000
1234567891000100
123456789unlimited2
fmt\.Error101000
fmt\.Error1001000
fmt\.Error1000100
fmt\.Errorunlimited2
fmt\.Println101000
fmt\.Println1001000
fmt\.Println1000100
fmt\.Printlnunlimited2
bytes.Buffer104
bytes.Buffer1004
bytes.Buffer10004
bytes.Bufferunlimited2
fmt\.Print.*101000
fmt\.Print.*1001000
fmt\.Print.*1000100
fmt\.Print.*unlimited2
ac8ac5d63b66b83b90ce41a2d4061635101000
ac8ac5d63b66b83b90ce41a2d40616351001000
ac8ac5d63b66b83b90ce41a2d40616351000100
ac8ac5d63b66b83b90ce41a2d4061635unlimited2
d97f1d3ff91543[e-f]49.8b07517548877101000
d97f1d3ff91543[e-f]49.8b075175488771001000
d97f1d3ff91543[e-f]49.8b075175488771000100
d97f1d3ff91543[e-f]49.8b07517548877unlimited2

Query performance

In total, we executed 19,936 search queries against the database (linearly, not in parallel) which completed in the following times:

Time bucketPercentage of queriesNumber of queries
Under 50ms30%5,933
Under 250ms41%8,088
Under 500ms52%10,275
Under 750ms63%12,473
Under 1s68%13,481
Under 1.5s74%14,697
Under 3s79%15,706
Under 25s79%15,708
Under 30s99%19,788

Query performance vs. planning time

The following scatter plot shows how 79% of queries executed in under 3s (Y axis, in ms), while Postgres’s query planner had planned them for execution in under 100-250ms generally (X axis, in ms):

image

If we expand the view to include all queries, we start to get a picture of just how outlier these 21% of queries are (note that the small block of dots in the bottom left represents the same diagram shown above):

image

Query time vs. CPU & Memory usage

The following image shows:

  • (top) Query time in milliseconds
  • (middle) CPU usage percentage (e.g. 801% refers to 8 out of 16 virtual CPU cores being consumed)
  • (bottom) Memory usage in MiB.
image

Notable insights from this are:

  • The large increase in resource usage towards the end is when we began executing queries with no LIMIT.
  • CPU usage does not exceed 138%, until the spike at the end.
  • Memory usage does not exceed 42 MiB, until the spike at the end.

We suspect pg_trgm is single-threaded within the scope of a single table, but with table data partitioning (or splitting data into multiple tables with subsets of the data), we suspect better parallelism could be achieved.

Investigating slow queries

If we plot the number of index rechecks (X axis) vs. execution time (Y axis), we can clearly see one of the most significant aspects of slow queries is that they have many more index rechecks:

image

And if we look at the EXPLAIN ANALYZE output for one of these queries we can also confirm Parallel Bitmap Heap Scan is slow due to Rows Removed by Index Recheck.

Table splitting

Splitting up the search index into multiple smaller tables seems like an obvious approach to getting pg_trgm to use multiple CPU cores. We tried this by taking the same exact data set and splitting it into 200 tables, and found numerous benefits:

Benefit 1: Incremental indexing

If indexing fails after 11-27 hours, as happened to us twice in the non-splitting approach, all progress is not lost.

Benefit 2: Parallel indexing

Unlike our first non-splitting approach, which showed we were only able to utilize 1.5-2 virtual CPU cores, with multiple tables we are able to utilize 8-9 virtual CPU cores:

image

Benefit 3: Indexing is 84% faster

Unlike our first attempt which took 22 hours in total, parallel indexing completed in only 3h27m.

Benefit 4: Indexing uses 69% less memory

With non-splitting we saw peak memory usage up to 12 GiB. With the same exact Postgres configuration, we were able to index with only 3.7 GiB peak memory usage:

image

Benefit 4: Parallel querying

Previously, we saw CPU utilization of only 138% (1.3 virtual CPU cores), with table splitting we see CPU utilization during queries of 1600% (16 virtual CPU cores) showing we are doing work fully in parallel:

image

Similarly, we saw memory usage average around ~380 MiB, compared to only ~42 MiB before:

image

Benefit 5: Query performance

We reran the same exact set of search queries, but a smaller number of times overall (350 queries, instead of 19.9k - which we found to still be a representative enough sample.)

As we can see below, table splitting in general led to a 200-300% improvement in query time for heavier queries that previously took 20-30s, now taking only 7-15s thanks to parallel querying (top chart is before, bottom is after, both in milliseconds):

image

We also grouped queries based on the LIMIT specified in the query and placed them into time buckets (“how many queries completed in under 50ms?") - comparing the two shows that less complex queries and/or queries for fewer results were negatively affected slightly, while larger queries were helped substantially:

Change (positive is good)Results limitBucketQueries in bucket beforeQueries in bucket after
-33%10<50ms33%0%
+13%10<250ms44%57%
+33%10<1s77%100%
-29%100<100ms29%0%
+20%100<500ms50%70%
+19%100<10s80%99%
-12%1000<250ms12%0%
-13%1000<2.5s77%64%
+23%1000<20s77%100%
+4%none<20s0%4%
+18%none<60s0%18%

Detailed comparisons are available below for those interested:

Queries with `LIMIT 10`
Time bucketPercentage of queries (before)Percentage of queries (after splitting)
50ms33.00% (2999 of 9004)0% (0 of 100)
100ms33.00% (2999 of 9004)1.00% (1 of 100)
250ms44.00% (3999 of 9004)57.00% (57 of 100)
500ms55.00% (4999 of 9004)79.00% (79 of 100)
1000ms77.00% (6998 of 9004)80.00% (80 of 100)
2500ms77.00% (7003 of 9004)80.00% (80 of 100)
5000ms77.00% (7004 of 9004)80.00% (80 of 100)
10000ms77.00% (7004 of 9004)100.00% (100 of 100)
20000ms77.00% (7004 of 9004)100.00% (100 of 100)
30000ms99.00% (8985 of 9004)100.00% (100 of 100)
40000ms99.00% (9003 of 9004)100.00% (100 of 100)
50000ms100.00% (9004 of 9004)100.00% (100 of 100)
60000ms100.00% (9004 of 9004)100.00% (100 of 100)
Queries with `LIMIT 100`
Time bucketPercentage of queries (before)Percentage of queries (after splitting)
50ms29.00% (2934 of 10000)0% (0 of 100)
100ms29.00% (2978 of 10000)0% (0 of 100)
250ms39.00% (3975 of 10000)31.00% (31 of 100)
500ms50.00% (5000 of 10000)70.00% (70 of 100)
1000ms59.00% (5984 of 10000)79.00% (79 of 100)
2500ms79.00% (7996 of 10000)80.00% (80 of 100)
5000ms80.00% (8000 of 10000)80.00% (80 of 100)
10000ms80.00% (8000 of 10000)99.00% (99 of 100)
20000ms80.00% (8000 of 10000)100.00% (100 of 100)
30000ms99.00% (9999 of 10000)100.00% (100 of 100)
40000ms100.00% (10000 of 10000)100.00% (100 of 100)
50000ms100.00% (10000 of 10000)100.00% (100 of 100)
60000ms100.00% (10000 of 10000)100.00% (100 of 100)
Queries with `LIMIT 1000`
Time bucketPercentage of queries (before)Percentage of queries (after splitting)
50ms0% (0 of 904)0% (0 of 100)
100ms0% (1 of 904)0% (0 of 100)
250ms12.00% (114 of 904)0% (0 of 100)
500ms30.00% (276 of 904)21.00% (21 of 100)
1000ms55.00% (499 of 904)41.00% (41 of 100)
2500ms77.00% (700 of 904)64.00% (64 of 100)
5000ms77.00% (704 of 904)77.00% (77 of 100)
10000ms77.00% (704 of 904)98.00% (98 of 100)
20000ms77.00% (704 of 904)100.00% (100 of 100)
30000ms88.00% (804 of 904)100.00% (100 of 100)
40000ms99.00% (901 of 904)100.00% (100 of 100)
50000ms99.00% (903 of 904)100.00% (100 of 100)
60000ms100.00% (904 of 904)100.00% (100 of 100)
Queries with no limit`
Time bucketPercentage of queries (before)Percentage of queries (after splitting)
50ms0% (0 of 28)0% (0 of 50)
100ms0% (0 of 28)0% (0 of 50)
250ms0% (0 of 28)0% (0 of 50)
500ms0% (0 of 28)0% (0 of 50)
1000ms0% (0 of 28)0% (0 of 50)
2500ms0% (0 of 28)0% (0 of 50)
5000ms0% (0 of 28)0% (0 of 50)
10000ms0% (0 of 28)0% (0 of 50)
20000ms0% (0 of 28)4.00% (2 of 50)
30000ms0% (0 of 28)16.00% (8 of 50)
40000ms0% (0 of 28)16.00% (8 of 50)
50000ms0% (0 of 28)18.00% (9 of 50)
60000ms0% (0 of 28)18.00% (9 of 50)

Postgres-in-Docker vs. native Postgres

Addition made Feb 20, 2021

In our original article we did not clarify the performance impacts of running Postgres inside of Docker with a volume bind mount. This was raised as a potential source of IO performance difference to us by Thorsten Ball.

We ran all tests above with Postgres in Docker, using a volume bind mount (the osxfs driver, not the experimental FUSE gRPC driver.)

We additionally ran the same table-splitting benchmarks on a native Postgres server (reproduction steps here) and found the following key changes:

CPU usage & memory usage: approximately the same

CPU and memory usage was approximately the same as in our Docker Postgres tests.

We anticipated this would be the case as the Macbook does have VT-x virtualization enabled (default on all i7/i9 Macbooks, and confirmed through sysctl kern.hv_support)

Indexing speed was ~88% faster

Running the statements to split up the large table into multiple smaller ones, i.e.:

CREATE TABLE files_000 AS SELECT * FROM files WHERE id > 0 AND id < 50000;
CREATE TABLE files_001 AS SELECT * FROM files WHERE id > 50000 AND id < 100000;
...

Was much faster in native Postgres, taking about 2-8s for each table instead of 20-40s previously, and taking only 15m in total instead of 2h before.

Parallel creation of the Trigram indexes using e.g.:

CREATE INDEX IF NOT EXISTS files_000_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);

Was also much faster, taking only 23m compared to ~3h with Docker.

Query performance is 12-99% faster, depending on query

We re-ran the same 350 queries as in our earlier table-splitting benchmark, and found the following substantial improvements:

  1. Queries that were previously very slow noticed a ~12% improvement. This is likely due to IO operations needed when interfacing with the 200 separate tables.
  2. Queries that were previously in the middle-ground noticed meager ~5% improvements.
  3. Queries that were previously fairly fast (likely searching only over a one or two tables before returning) noticed substantial 16-99% improvements.
Exhaustive comparison details (negative change is good)
ChangeTime bucketQueries under bucket beforeQueries under bucket after
0%500s350 of 350350 of 350
-12%100s309 of 350350 of 350
-12%50s309 of 350350 of 350
-12%40s308 of 350350 of 350
-12%30s308 of 350349 of 350
-7%25s307 of 350330 of 350
-7%25s307 of 350330 of 350
-8%20s302 of 350330 of 350
-8%20s302 of 350330 of 350
-5%10s297 of 350311 of 350
-26%5s237 of 350319 of 350
-7%2500ms224 of 350240 of 350
-9%2000ms219 of 350240 of 350
-9%1500ms219 of 350240 of 350
-16%1000ms200 of 350237 of 350
-14%750ms190 of 350221 of 350
-23%500ms170 of 350220 of 350
-59%250ms88 of 350217 of 350
-99%100ms1 of 350168 of 350
-99%50ms1 of 350168 of 350

Conclusions

We think the following learnings are most important:

  • .git directories, even with --depth=1 clones, account for 30% of a repositories size on disk (at least in top 10,000 GitHub repositories.)
  • Files > 1 MiB (often binaries) account for another 51% of the data size on disk of repositories.
  • On only a Macbook Pro, it is possible to get Postgres Trigram regex search over 10,000 repositories to run most reasonable queries in under 5s - and certainly much faster with more hardware.
  • pg_trgm performs single-threaded indexing and querying, unless you split your data up into multiple tables.
  • By default, a Postgres text colum will be compressed by Postgres on disk out of the box - resulting in a 23% reduction in size (with the files we inserted.)
  • pg_trgm GIN indexes take around 26% the size of your data on disk. So if indexing 1 GiB of raw text, expect Postgres to store that text in around ~827 MiB plus 279 MiB for the GIN trigram index.
  • Splitting your data into multiple tables if using pg_trgm is an obvious win, as it allows for paralle indexing which can be the difference between 4h vs 22h. It also reduces the risk of an indexing failure after 22h due to e.g. lack of memory and uses much less peak memory overall.
  • Docker bind mounts (not volumes) are quite slow outside of Linux host environments (there are many other articles on this subject.)

If you are looking for fast regexp or code search today, consider:

  • Sourcegraph (disclaimer: the author works here, but this article is not endorsed or affiliated in any way)
  • Zoekt
  • Ripgrep

Follow this devlog for updates as we continue investigating faster ways to do regexp & ngram search at large scales.