Skip to content

fix log backup state query failure handling#6950

Open
RidRisR wants to merge 2 commits into
pingcap:release-1.xfrom
RidRisR:codex/log-backup-query-failure
Open

fix log backup state query failure handling#6950
RidRisR wants to merge 2 commits into
pingcap:release-1.xfrom
RidRisR:codex/log-backup-query-failure

Conversation

@RidRisR

@RidRisR RidRisR commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What changed

This updates log backup tracker state queries to distinguish critical PD etcd query failures from confirmed missing metadata:

  • Treat info key query failures as unknown state instead of task-not-found.
  • Start a 10-minute critical query failure countdown for info query failures and PD etcd client creation failures.
  • Report BackupFailed / LogBackupStateQueryFailed only once per continuous failure window, and retry reporting if the status update itself fails.
  • Treat pause key query failures as partial state: keep usable info/checkpoint data, skip kernel state sync, and avoid marking the backup failed.
  • Preserve checkpoint updates when pause state is unknown.

Why

A transient PD/etcd/DNS issue could previously leave InfoExists=false and be interpreted as LogBackupTaskNotFound, incorrectly failing log backup even though task existence was unknown.

Validation

  • GOCACHE=/tmp/go-cache go test ./pkg/backup/backup -count=1
  • GOCACHE=/tmp/go-cache go test -race ./pkg/backup/backup -count=1

@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot requested a review from howardlau1999 June 15, 2026 09:50
@ti-chi-bot ti-chi-bot Bot added the size/XXL label Jun 15, 2026
@RidRisR RidRisR marked this pull request as ready for review June 15, 2026 10:01
@ti-chi-bot

ti-chi-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@RidRisR: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-e2e-kind-scale-simultaneously 02407b6 link false /test pull-e2e-kind-scale-simultaneously
pull-e2e-kind-tngm 02407b6 link false /test pull-e2e-kind-tngm
pull-e2e-kind-dmcluster 02407b6 link false /test pull-e2e-kind-dmcluster
pull-e2e-kind-basic 02407b6 link false /test pull-e2e-kind-basic
pull-e2e-kind-tidbcluster 02407b6 link false /test pull-e2e-kind-tidbcluster
pull-e2e-kind-br 02407b6 link false /test pull-e2e-kind-br
pull-e2e-kind-across-kubernetes 02407b6 link false /test pull-e2e-kind-across-kubernetes
pull-e2e-kind-serial 02407b6 link false /test pull-e2e-kind-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant