Skip to content

Refactor Postgres migrator#6573

Open
nadav-govari wants to merge 4 commits into
mainfrom
nadav/migrator
Open

Refactor Postgres migrator#6573
nadav-govari wants to merge 4 commits into
mainfrom
nadav/migrator

Conversation

@nadav-govari

@nadav-govari nadav-govari commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Description

Advisory locks reading: https://flaviodelgrosso.com/blog/postgresql-advisory-locks

Issue 1

Today, the Postgres migrator begins a transaction, and runs all migrations inside their own transaction block, such that if any one fails, the whole thing gets unwound.

This is undesirable for two reasons:

  1. The migrations build on each other; migration 3 failing does not mean theres a problem with migration 2
  2. Concurrent DDL cannot be run inside of a transaction block

This refactors the migrator to run each migration directly, without a wrapping block. sqlx already applies migrations transactionally for us.

Issue 2

Migrations must be successfully applied on startup for a metapod to mark itself ready on a deployment. Sqlx takes out a Postgres advisory lock, which is a Postgres feature that allows for utilizing Postgres' atomicity to lock on custom resources, which you define with a key of (int, int). It then runs the migrations until completion.

This is a problem for any migration that takes more than the readiness timeout, which is currently 5 minutes. The readiness probe will kill the metastore pod mid applying migrations, and will start crashlooping. Additionally, other metastore pods will queue up to acquire this same lock so that they can too apply migrations.

The fix for this is as follows:

  1. Migrations that are absolutely required for functionality stay where they are, with diesel guard inspecting all migrations in CI and preventing migrations that acquire exclusive locks on a table from being merged.
  2. Migrations that can be applied concurrently, such as index creation, are moved to a second migrations directory introduced in this PR, postgres_deferred, and applied async post-deployment. To prevent the same racing behavior described above, Postgres advisory locks are again reused. A pod will attempt to acquire the advisory lock; if it succeeds, it is the migrations leader and will attempt to apply the migrations. If it doesn't, then it will assume another pod has the lock, and continue on with its day. At all points, live traffic is still served.

Finally, ignore_missing on migrations is marked true. This will allow us forwards and backwards compatibility with migrations and binaries, as of this change. It will be up to us to ensure that the code also respects this.

How was this PR tested?

Postgres integ tests, local cluster tests incoming.

@nadav-govari nadav-govari requested a review from a team as a code owner July 2, 2026 12:15

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1b638bbbd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quickwit/quickwit-metastore/src/metastore/postgres/migrator.rs
Comment thread quickwit/quickwit-metastore/migrations/postgresql_deferred/README.md Outdated
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/migrator.rs
Comment thread quickwit/quickwit-metastore/migrations/postgresql_deferred/README.md Outdated
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metrics.rs Outdated
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Outdated
Comment thread quickwit/Makefile
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/migrator.rs Outdated
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/migrator.rs
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/migrator.rs
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/migrator.rs Outdated
Comment thread quickwit/quickwit-metastore/src/metastore/postgres/migrator.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50775793dd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

`_sqlx_migrations` table with the required track, so version numbers must be globally unique across both dirs and each
migration must be idempotent. See `migrations/postgresql_deferred/README.md` for authoring rules.
```
sqlx migrate run --database-url postgres://quickwit-dev:quickwit-dev@localhost:5432/quickwit-metastore-dev --source migrations/postgresql_deferred

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add ignore-missing to the deferred CLI command

Because the deferred directory shares _sqlx_migrations with the required track, this command will normally run after the required migration versions are already recorded. I checked sqlx-cli 0.8.6: migrate run rejects applied versions that are absent from the selected --source unless --ignore-missing is set (see validate_applied_migrations in https://raw.githubusercontent.com/launchbadge/sqlx/v0.8.6/sqlx-cli/src/migrate.rs), so the documented deferred command fails with VersionMissing before applying anything. Please include --ignore-missing in this manual workflow.

Useful? React with 👍 / 👎.

Comment on lines +11 to +12
DROP INDEX CONCURRENTLY IF EXISTS foo_idx;
CREATE INDEX CONCURRENTLY IF NOT EXISTS foo_idx ON foo (bar);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split concurrent index DDL into separate migrations

This example still puts DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY in one -- no-transaction sqlx migration. I checked sqlx 0.8.6: no-transaction migrations are sent as one execute(&*migration.sql) call (https://docs.rs/crate/sqlx-postgres/0.8.6/source/src/migrate.rs), and PostgreSQL treats multiple semicolon-separated statements in one Query as a single implicit transaction (https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-MULTI-STATEMENT), so following this example fails with cannot run inside a transaction block. Please document one concurrent DDL statement per migration or an explicit split mechanism.

Useful? React with 👍 / 👎.

Comment on lines +417 to +420
migrations(&pool, false)
.run_required(test_migrator().await)
.await
.unwrap_err();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset test migration before read-only failure check

This test doesn't clear version 90000 before opening the read-only pool. Several earlier serial tests leave that migration row applied, so in a reused QW_TEST_DATABASE_URL this call can be a no-op that succeeds and makes unwrap_err() panic instead of exercising the read-only failure path; reset/delete the test migration with a writable connection before this check.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants