Skip to content

raft: expose controller leader_for gauge on public metrics#30826

Open
travisdowns wants to merge 2 commits into
redpanda-data:devfrom
travisdowns:td-CORE-16394-expose-controller-leader
Open

raft: expose controller leader_for gauge on public metrics#30826
travisdowns wants to merge 2 commits into
redpanda-data:devfrom
travisdowns:td-CORE-16394-expose-controller-leader

Conversation

@travisdowns

Copy link
Copy Markdown
Member

The raft leader_for 0/1 leadership gauge is only registered on the internal
metrics endpoint (vectorized_raft_leader_for). There is no public-metrics
signal for which node is the controller leader, so external consumers
(dashboards, alerts) that only scrape /public_metrics cannot identify the
controller leader.

This exposes the gauge on the public endpoint as redpanda_raft_leader_for,
restricted to the controller group so we don't add a per-partition public
series for every raft group. The registration lives in
consensus::setup_public_metrics() next to where the internal leader_for
gauge is set up, using a consensus-owned public metrics handle and the
redpanda_-prefixed public label names. A value of 1 on a node means that node
is the controller leader; 0 otherwise.

The ducktape test cluster_metrics_reported_only_by_leader_test is extended to
assert the controller leader_for gauge reads 1 on the leader and 0 on every
other running node, on both the internal and public endpoints, across the
restart, failover, and no-quorum transitions the test already drives.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Improvements

  • The controller leader is now observable on the public metrics endpoint via the new redpanda_raft_leader_for gauge (1 on the controller leader, 0 otherwise).

The raft leader_for 0/1 leadership gauge was only registered on the
internal metrics endpoint. Expose it on the public endpoint as
redpanda_raft_leader_for, restricted to the controller group so external
consumers can identify the controller leader without adding a
per-partition public series for every raft group.
Extend cluster_metrics_reported_only_by_leader_test to check the raft
leader_for gauge for the controller group: it must read 1 on the
controller leader and 0 on every other running node, on both the internal
(vectorized_raft_leader_for) and public (redpanda_raft_leader_for)
endpoints, across the restart, failover, and no-quorum transitions the
test already drives.
Copilot AI review requested due to automatic review settings June 16, 2026 19:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the controller leader externally observable via /public_metrics by exposing the existing Raft leader_for gauge on the public metrics handle (as redpanda_raft_leader_for), limited to the controller Raft group to avoid high-cardinality public series.

Changes:

  • Register a public raft.leader_for gauge for the controller NTP only, using metrics::public_metric_groups and redpanda_-prefixed label names.
  • Extend the ducktape cluster metrics test to assert the controller leader_for gauge is 1 on the controller leader and 0 on all other running nodes, for both internal and public endpoints.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/rptest/tests/cluster_metrics_test.py Adds assertions that the controller raft_leader_for gauge is correct on both /metrics and /public_metrics through restart/failover/no-quorum transitions.
src/v/raft/consensus.h Adds a metrics::public_metric_groups member to allow consensus-owned public metric registration.
src/v/raft/consensus.cc Registers leader_for on the public metrics handle for the controller group only, with public label names.

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#85870
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/85870#019ed1eb-21db-43f9-8183-8d8b0fe56fad 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0046, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce

Comment thread src/v/raft/consensus.cc
}

// Public metrics carry redpanda_-prefixed label names, unlike the internal
// (bare) labels used by setup_metrics.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool story Opus

Comment thread src/v/raft/consensus.cc
[this] { return is_elected_leader(); },
sm::description("Indicates if this node is the controller leader"),
labels)
.aggregate({sm::shard_label})});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, doesn't aggregate.

@StephanDollberg StephanDollberg left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants