Skip to content

fix(access-control-service): include port in computing unit pod URI and use Envoy Gateway for distributed CUs#5629

Open
aicam wants to merge 6 commits into
apache:mainfrom
aicam:fix/cu-pod-uri-missing-port
Open

fix(access-control-service): include port in computing unit pod URI and use Envoy Gateway for distributed CUs#5629
aicam wants to merge 6 commits into
apache:mainfrom
aicam:fix/cu-pod-uri-missing-port

Conversation

@aicam

@aicam aicam commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Make the in-cluster address of a computing unit come from a single source of truth — the URI recorded when its pod is created — and ensure that URI is complete (includes the port). This lets the gateway route a user to a computing unit located anywhere it can reach (in the local cluster, another cluster, or an external host), instead of being limited to a reconstructed in-cluster address. See #5630.

Two related changes:

1. Include the port in the generated pod URI (computing-unit-managing-service)

KubernetesClient.generatePodURI builds the address stored as the computing unit's uri (via setUri in ComputingUnitManagingResource) and returned to clients as nodeAddresses. The pod's container listens on KubernetesConfig.computeUnitPortNumber (declared with withContainerPort(...) in the same file), but the generated URI omitted the port, so the persisted address was not directly connectable. The port is now appended:

s"...svc.cluster.local:${KubernetesConfig.computeUnitPortNumber}"

2. Route using the recorded URI (access-control-service)

AccessControlResource rebuilt the computing unit's address from KubernetesConfig on every authorization request, duplicating the construction logic in generatePodURI and pinning every CU to the local cluster. It now reads the URI recorded for the unit and returns it as the Host for the gateway to route to. If no URI has been recorded, the unit is not routable and the request is refused with 403 (no in-cluster fallback, per review).

Routing flow

The access-control service is the gateway's external authorizer; the Host it returns is the upstream Envoy forwards the (upgraded) connection to. Because that host comes from the unit's recorded URI, the same gateway can reach computing units in different locations:

flowchart LR
    FE["Frontend<br/>(/wsapi?cuid=N)"] --> GW["Envoy Gateway"]
    GW -. "ext-auth: authorize + get Host" .-> ACS["access-control-service"]
    ACS -- "read recorded uri for CU N" --> DB[("workflow_computing_unit")]
    ACS -- "Host = recorded uri<br/>(or 403 if none)" --> GW
    GW == "dynamic forward proxy<br/>to returned Host" ==> R{Where the CU lives}
    R --> CU1["In-cluster CU pod<br/>computing-unit-N...svc.cluster.local:port"]
    R --> CU2["CU in another cluster"]
    R --> CU3["External / remote CU host:port"]
Loading

Any related issues, documentation, discussions?

How was this PR tested?

On live deployment.
Screenshot from 2026-06-13 13-31-00

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

`KubernetesClient.generatePodURI` builds the in-cluster address that is
stored as the computing unit's `uri` (via `setUri` in
`ComputingUnitManagingResource` and returned to clients as
`nodeAddresses`). The pod's container listens on
`KubernetesConfig.computeUnitPortNumber` (set via `withContainerPort`),
but the generated URI omitted the port, so the stored address pointed at
the default port instead of the one the computing unit actually serves
on. Append the configured port so the persisted URI is directly
connectable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added fix platform Non-amber Scala service paths labels Jun 11, 2026
@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.95%. Comparing base (891d2ad) to head (b3f713f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...exera/service/resource/AccessControlResource.scala 72.72% 0 Missing and 3 partials ⚠️
.../apache/texera/service/util/KubernetesClient.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #5629   +/-   ##
=========================================
  Coverage     52.95%   52.95%           
  Complexity     2627     2627           
=========================================
  Files          1090     1090           
  Lines         42210    42217    +7     
  Branches       4534     4538    +4     
=========================================
+ Hits          22353    22357    +4     
  Misses        18546    18546           
- Partials       1311     1314    +3     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <72.72%> (-0.48%) ⬇️
agent-service 34.36% <ø> (ø) Carriedforward from 891d2ad
amber 53.11% <ø> (ø) Carriedforward from 891d2ad
computing-unit-managing-service 1.65% <0.00%> (ø)
config-service 56.71% <ø> (ø)
file-service 57.06% <ø> (ø)
frontend 47.93% <ø> (ø) Carriedforward from 891d2ad
pyamber 89.77% <ø> (ø) Carriedforward from 891d2ad
python 90.73% <ø> (ø) Carriedforward from 891d2ad
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The access-control service rebuilt the computing unit's in-cluster
address from `KubernetesConfig` on every authorization request, which
duplicates the address-construction logic already in
`KubernetesClient.generatePodURI` and can drift from it (e.g. service
name vs. pool-name conventions). Read the URI persisted for the unit
(written by the managing service when the pod is created) and route to
it directly, so the routing target comes from a single source of truth.
Fall back to the previously constructed in-cluster address when no URI
has been recorded for the unit yet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aicam aicam changed the title fix: include port in computing unit pod URI fix: route computing units to their recorded pod URI (incl. port) Jun 11, 2026
@aicam aicam requested a review from bobbai00 June 11, 2026 21:21
@aicam aicam changed the title fix: route computing units to their recorded pod URI (incl. port) fix(access-control-service): include port in computing unit pod URI and use Envoy Gateway for distributed CUs Jun 11, 2026

@bobbai00 bobbai00 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Please include a diagram of frontend able to connect to CUs at different places through the envoy gateway

Per review, drop the in-cluster address fallback in the access-control
service. A computing unit is routed only to the URI recorded for it; if
no URI has been recorded the unit is not routable, so the authorization
request is refused (403) instead of falling back to a reconstructed
in-cluster address. Also drops the now-unused KubernetesConfig import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aicam aicam requested a review from bobbai00 June 13, 2026 20:32

@bobbai00 bobbai00 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bobbai00

Copy link
Copy Markdown
Contributor

@aicam please add test cases to raise the test coverage of your changes

Add coverage for the dynamic routing logic in AccessControlResource:
record a URI on the existing test computing unit and assert the rewritten
Host header carries it, and add two computing units (no URI, blank URI)
that the user can access but which are refused with FORBIDDEN.

This also fixes the existing OK-path tests, which previously failed under
the refuse-when-no-URI behavior because the test unit had no recorded URI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aicam

aicam commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Added test coverage for the dynamic routing logic in AccessControlResourceSpec (commit e103ad9):

  • The existing test computing unit now has a recorded URI, and the "user has access" test asserts the rewritten Host header carries exactly that URI (covers the route-to-recorded-URI branch).
  • Two new units the user can access are refused with FORBIDDEN: one with no URI recorded, and one whose URI is blank/whitespace (covers the refuse-when-no-URI branch).

This also fixes the existing OK-path tests, which were failing under the new refuse-when-no-URI behavior because the test unit had no recorded URI. All 12 tests in the spec pass locally.

# Conflicts:
#	access-control-service/src/main/scala/org/apache/texera/service/resource/AccessControlResource.scala
@aicam aicam enabled auto-merge June 15, 2026 21:59
@aicam aicam added this pull request to the merge queue Jun 15, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix platform Non-amber Scala service paths

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(access-control-service): route computing units by their recorded URI to enable out-of-cluster CU distribution

3 participants