fix: Drop stale routing entries on NoShardsAvailable failures. Closes #6324 by osyniakov · Pull Request #6566 · quickwit-oss/quickwit

osyniakov · 2026-06-30T12:13:42Z

Description

When an index was deleted and recreated, the router's per-ingester routing entry for the old incarnation could stay marked as having open shards because the ingester's piggybacked routing update only covers sources it still holds. Persist retries then kept picking the dead entry and the request surfaced as a 503 until Chitchat eventually caught up.

Treat a NoShardsAvailable failure as a signal that this (leader, index_uid, source_id) has no reachable shard and zero it out in the routing table. If no nodes remain for that (index_id, source_id) the next attempt re-queries the control plane, which returns the fresh incarnation's shards.

How was this PR tested?

Automated tests

Fixes #6324

When an index was deleted and recreated, the router's per-ingester routing entry for the old incarnation could stay marked as having open shards because the ingester's piggybacked routing update only covers sources it still holds. Persist retries then kept picking the dead entry and the request surfaced as a 503 until Chitchat eventually caught up. Treat a `NoShardsAvailable` failure as a signal that this (leader, index_uid, source_id) has no reachable shard and zero it out in the routing table. If no nodes remain for that (index_id, source_id) the next attempt re-queries the control plane, which returns the fresh incarnation's shards. Fixes quickwit-oss#6324

Clarifies the hidden contract the fix leans on: the zero-out and piggybacked routing update run under the same lock, which is what keeps the rate-limited subcase of NoShardsAvailable correct.

Address PR review: introduce RoutingTable::mark_node_no_shards instead of calling apply_capacity_update(.., 0, 0). The new method only zeros the open_shard_count and leaves the capacity_score untouched (capacity is a node-level WAL signal independent of any specific source). It also no-ops on missing entries/nodes and on incarnation mismatches, so a narrowing signal can never roll back a fresher entry.

…shards Address PR review: replace the != short-circuit with the same Less / Equal / Greater cmp match used by apply_capacity_update and merge_from_shards. A stale signal (entry newer than the failure's index_uid) is still ignored; a signal for a newer incarnation now advances the entry, drops stale nodes, and forces a CP re-seed — consistent with how the rest of the routing table handles monotonic incarnations.

osyniakov · 2026-06-30T12:15:33Z

@nadav-govari @guilload could you please review this one?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d5fbd0b8f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-07-01T17:25:23Z

+        if let Some(node) = entry.nodes.get_mut(node_id) {
+            node.open_shard_count = 0;


Invalidate all stale leaders before retrying

When a source has more than four stale leaders (for example an index with ingest_settings.min_shards >= 5 spread across ingesters), zeroing only the node that just returned NoShardsAvailable burns one ingest attempt per stale leader. retry_batch_persist stops after MAX_PERSIST_ATTEMPTS (5), so the router can exhaust all attempts before has_any_routing_candidate becomes false and the control plane is queried; recreated indexes in that configuration can still return NoShardsAvailable until gossip clears the old entries. Consider invalidating all nodes for the failed (index_uid, source_id) when the routing update does not refresh it.

Useful? React with 👍 / 👎.

claude added 4 commits June 30, 2026 10:10

fix: document routing_update invariant on NoShardsAvailable fix

599835a

Clarifies the hidden contract the fix leans on: the zero-out and piggybacked routing update run under the same lock, which is what keeps the rate-limited subcase of NoShardsAvailable correct.

osyniakov requested a review from a team as a code owner June 30, 2026 12:13

osyniakov mentioned this pull request Jun 30, 2026

fix: drop stale routing entries on NoShardsAvailable failures osyniakov/quickwit#2

Closed

osyniakov added 2 commits June 30, 2026 14:54

Merge branch 'main' into claude/fix-issue-6324-mHady

ea81dd7

Merge branch 'main' into claude/fix-issue-6324-mHady

6d5fbd0

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Drop stale routing entries on NoShardsAvailable failures. Closes #6324#6566

fix: Drop stale routing entries on NoShardsAvailable failures. Closes #6324#6566
osyniakov wants to merge 6 commits into
quickwit-oss:mainfrom
osyniakov:claude/fix-issue-6324-mHady

osyniakov commented Jun 30, 2026 •

edited

Loading

Uh oh!

osyniakov commented Jun 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if let Some(node) = entry.nodes.get_mut(node_id) {
		node.open_shard_count = 0;

Uh oh!

Conversation

osyniakov commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How was this PR tested?

Uh oh!

osyniakov commented Jun 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

osyniakov commented Jun 30, 2026 •

edited

Loading