Refactor Postgres migrator#6573
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a1b638bbbd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50775793dd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| `_sqlx_migrations` table with the required track, so version numbers must be globally unique across both dirs and each | ||
| migration must be idempotent. See `migrations/postgresql_deferred/README.md` for authoring rules. | ||
| ``` | ||
| sqlx migrate run --database-url postgres://quickwit-dev:quickwit-dev@localhost:5432/quickwit-metastore-dev --source migrations/postgresql_deferred |
There was a problem hiding this comment.
Add ignore-missing to the deferred CLI command
Because the deferred directory shares _sqlx_migrations with the required track, this command will normally run after the required migration versions are already recorded. I checked sqlx-cli 0.8.6: migrate run rejects applied versions that are absent from the selected --source unless --ignore-missing is set (see validate_applied_migrations in https://raw.githubusercontent.com/launchbadge/sqlx/v0.8.6/sqlx-cli/src/migrate.rs), so the documented deferred command fails with VersionMissing before applying anything. Please include --ignore-missing in this manual workflow.
Useful? React with 👍 / 👎.
| DROP INDEX CONCURRENTLY IF EXISTS foo_idx; | ||
| CREATE INDEX CONCURRENTLY IF NOT EXISTS foo_idx ON foo (bar); |
There was a problem hiding this comment.
Split concurrent index DDL into separate migrations
This example still puts DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY in one -- no-transaction sqlx migration. I checked sqlx 0.8.6: no-transaction migrations are sent as one execute(&*migration.sql) call (https://docs.rs/crate/sqlx-postgres/0.8.6/source/src/migrate.rs), and PostgreSQL treats multiple semicolon-separated statements in one Query as a single implicit transaction (https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-MULTI-STATEMENT), so following this example fails with cannot run inside a transaction block. Please document one concurrent DDL statement per migration or an explicit split mechanism.
Useful? React with 👍 / 👎.
| migrations(&pool, false) | ||
| .run_required(test_migrator().await) | ||
| .await | ||
| .unwrap_err(); |
There was a problem hiding this comment.
Reset test migration before read-only failure check
This test doesn't clear version 90000 before opening the read-only pool. Several earlier serial tests leave that migration row applied, so in a reused QW_TEST_DATABASE_URL this call can be a no-op that succeeds and makes unwrap_err() panic instead of exercising the read-only failure path; reset/delete the test migration with a writable connection before this check.
Useful? React with 👍 / 👎.
Description
Advisory locks reading: https://flaviodelgrosso.com/blog/postgresql-advisory-locks
Issue 1
Today, the Postgres migrator begins a transaction, and runs all migrations inside their own transaction block, such that if any one fails, the whole thing gets unwound.
This is undesirable for two reasons:
This refactors the migrator to run each migration directly, without a wrapping block. sqlx already applies migrations transactionally for us.
Issue 2
Migrations must be successfully applied on startup for a metapod to mark itself ready on a deployment. Sqlx takes out a Postgres advisory lock, which is a Postgres feature that allows for utilizing Postgres' atomicity to lock on custom resources, which you define with a key of
(int, int). It then runs the migrations until completion.This is a problem for any migration that takes more than the readiness timeout, which is currently 5 minutes. The readiness probe will kill the metastore pod mid applying migrations, and will start crashlooping. Additionally, other metastore pods will queue up to acquire this same lock so that they can too apply migrations.
The fix for this is as follows:
postgres_deferred, and applied async post-deployment. To prevent the same racing behavior described above, Postgres advisory locks are again reused. A pod will attempt to acquire the advisory lock; if it succeeds, it is the migrations leader and will attempt to apply the migrations. If it doesn't, then it will assume another pod has the lock, and continue on with its day. At all points, live traffic is still served.Finally, ignore_missing on migrations is marked true. This will allow us forwards and backwards compatibility with migrations and binaries, as of this change. It will be up to us to ensure that the code also respects this.
How was this PR tested?
Postgres integ tests, local cluster tests incoming.