Skip to content

refactor(bigquery-jdbc): optimize thread management and unify discovery logic in DatabaseMetaData methods#13560

Open
keshavdandeva wants to merge 10 commits into
mainfrom
jdbc/refactor-metadata-methods
Open

refactor(bigquery-jdbc): optimize thread management and unify discovery logic in DatabaseMetaData methods#13560
keshavdandeva wants to merge 10 commits into
mainfrom
jdbc/refactor-metadata-methods

Conversation

@keshavdandeva

@keshavdandeva keshavdandeva commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

b/520407325
b/520406763

This PR refactors and optimizes the database metadata retrieval methods in the BigQuery JDBC driver. It resolves thread management inefficiencies, eliminates duplicate API blocks, ensures consistent project discovery support, and introduces consistent error propagation across all asynchronous metadata methods.

Key Changes

1. Catalog-Based Routing in getSchemas

  • Synchronous Single-Catalog Path: If a specific catalog (project) is requested, getSchemas now executes completely synchronously on the calling thread, eliminating background executor and queue overhead.
  • Parallel Multi-Catalog Path: If no catalog is specified, the query executor runs parallel scans across all accessible projects (primary, additional, and discovered) in the background.

2. Unified Dataset Discovery & Deduplication

  • Introduced a single, shared helper method fetchMatchingDatasets to serve as the sole entry point for listing and filtering datasets.
  • Deduplicated dataset-listing logic across 7 metadata methods (getSchemas, getTables, getColumns, getProcedures, getProcedureColumns, getFunctions, and getFunctionColumns), ensuring consistent support for project discovery and SQL wildcard matching across all of them.

3. Robust Background Error Propagation & Deduplication

  • Refactored catch blocks across all 7 asynchronous metadata methods to consistently capture unexpected background thread exceptions and write them to the result set queue. This prevents silent empty-result failures, ensuring that any backend or network errors during metadata scans are properly propagated to the client as a SQLException.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors BigQueryDatabaseMetaData.java to clean up unused imports, correct catalog null-checks, consolidate dataset fetching logic into a helper method, and optimize getSchemas with a synchronous single-catalog path and a parallel multi-catalog path. The review feedback highlights several critical improvements: ensuring submitted futures are cancelled in the finally block of getSchemas to prevent resource leaks, wrapping project scans in a try-catch block to make sequential dataset fetching robust against single-project failures, restoring the thread's interrupted status when catching InterruptedException, and replacing an unnecessary Collections.synchronizedList with a plain ArrayList.

@keshavdandeva

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors metadata fetching in BigQueryDatabaseMetaData by consolidating dataset fetching into a helper method, introducing a synchronous path for single-catalog queries in getSchemas, and improving error handling. It also adds a null check in BigQueryJsonResultSet when cancelling tasks. The review comments point out three critical issues: a potential deadlock in the synchronous path of getSchemas when the schema count exceeds the bounded queue capacity, a compilation error due to a missing parameter in a BigQueryJsonResultSet.of call, and swallowed InterruptedException along with silent failures in the new fetchMatchingDatasets helper.

@keshavdandeva keshavdandeva marked this pull request as ready for review June 26, 2026 17:47
@keshavdandeva keshavdandeva requested review from a team as code owners June 26, 2026 17:47
private List<Dataset> fetchMatchingDatasets(
String catalog, String schemaPattern, Pattern schemaRegex) throws SQLException {
List<String> projects =
(catalog != null) ? Collections.singletonList(catalog) : getAccessibleCatalogNames();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catalog="" is a special case, I think in BQ it should return an empty list since in BQ all datasets belong to some project.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is handled in all metadata methods already. We check if ((catalog != null && catalog.isEmpty()) and return empty resultset. So, any catalog with "" would not reach here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a side note, while confirming this, I found determineEffectiveCatalogAndSchema method that is used to help the connection property isFilterTablesOnDefaultDataset is not exactly optimized/working as it should. Will refactor it in separate PR

Thread.currentThread().interrupt();
break;
} catch (Exception e) {
if (catalog != null) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why special case for this one? I don't see anything about it in spec, so 2 scenarios in my mind:

  • Service account has some access to projectA, but no BQ access to list datasets..
  • call getSchemas(null, null) -> returns empty result, no exception
  • call getSchemas(projectA, null) -> throws exception..

So I think we should unify the behavior.
Spec is saying only "SQLException - if a database access error occurs").. So I think we can throw exception unless we were able to successfully fetch all data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so the reason I did it like this was for the case when projectDiscovery is enabled. Basically:

  • Targeted queries (catalog != null): Fail-fast and throw a SQLException on failure.
  • Discovery queries (catalog == null): Resilient/best-effort. We skip individual project failures (e.g. permission/billing issues on sandbox/test projects) with a warning log, and return results from other accessible projects. This prevents a single inactive project from breaking the entire metadata call when using user credentials (ADC).

To unify this, which approach would you prefer we take?

  1. Fully Unified (Fail-Fast): Always throw SQLException if any project fails during the scan, regardless of whether it was a targeted or discovery query. (as you suggested above)
  2. Conditional Hybrid (Project Discovery Only): Only use the best-effort model (log warning and skip) if EnableProjectDiscovery=true If discovery is disabled, any failure on the primary catalog or manually configured AdditionalProjects will throw a SQLException

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants