Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 52 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,64 @@
# Loading XML files into a relational database

`xml2db` is a Python package which allows parsing and loading XML files into a relational database. It handles complex
`xml2db` is a Python package that parses and loads XML files into a relational database. It handles complex
XML files which cannot be denormalized to flat tables, and works out of the box, without any custom mapping rules.

It can be used within an [Extract, Load, Transform](https://docs.getdbt.com/terms/elt) data pipeline pattern as it
allows loading XML files into a relational data model which is very close from the source data, yet easy to work with.
It fits naturally into an [Extract, Load, Transform](https://docs.getdbt.com/terms/elt) pipeline: it
loads XML files into a relational data model that stays close to the source data while remaining easy to query as
flat database tables. The raw data can then be transformed using [DBT](https://www.getdbt.com/), SQL views, or stored procedures to produce
more user-friendly tables.

Starting from an XSD schema which represents a given XML structure, `xml2db` builds a data model, i.e. a set of database
tables linked to each other by foreign keys relationships. Then, it allows parsing and loading XML files into the
database, and getting them back from the database into XML format if needed.

This package uses `sqlalchemy` to interact with the database, so it should work with different database backends.
Automated integration tests run against PostgreSQL, MySQL, MS SQL Server and DuckDB. You may have to install additional
packages to connect to your database (e.g. `psycopg2` or `psycopg` for PostgreSQL, `pymysql` or `mysqlclient` for
MySQL, `pyodbc` for MS SQL Server, or `duckdb-engine` for DuckDB).

**Please read the [package documentation website](https://cre-dev.github.io/xml2db) for all the details!**

## Installation

The package can be installed, preferably in a virtual environment, using `pip`:

``` bash
pip install xml2db
```

## CLI

After installation, `xml2db` is available as a command-line tool with three subcommands.

Explore your XSD schema and configure the data model interactively in a browser:

```bash
xml2db serve path/to/schema.xsd
```

This opens a page with an Entity Relationship Diagram, source/target tree views, DDL output, and a live YAML config
editor with autocomplete.

Import an XML file directly from the command line:

```bash
xml2db import file.xml schema.xsd \
--connection-string "postgresql+psycopg2://user:pw@host/db" \
--config model_config.yml
```

Render the ERD, trees, or DDL to stdout or a file without starting a server:

```bash
xml2db render schema.xsd --format erd
xml2db render schema.xsd --format ddl --db-type postgresql
```

See the [CLI reference](https://cre-dev.github.io/xml2db/cli/) for all options.

## Python API

Loading XML files into a relational database with `xml2db` can be as simple as:

```python
Expand All @@ -28,28 +77,6 @@ document = data_model.parse_xml(
document.insert_into_target_tables()
```

The data model created by `xml2db` will be close to the XSD schema. However, `xml2db` will perform a few systematic
simplifications aimed at limiting the complexity of the resulting data model and the storage footprint. The resulting
data model can be configured, but the above code will work out of the box, with reasonable defaults.

The raw data loaded into the database can then be processed if need be, using for instance [DBT](https://www.getdbt.com/),
SQL views or stored procedures aimed at extracting, correcting and formatting the data into more user-friendly tables.

This package uses `sqlalchemy` to interact with the database, so it should work with different database backends.
Automated integration tests run against PostgreSQL, MySQL, MS SQL Server and DuckDB. You may have to install additional
packages to connect to your database (e.g. `psycopg2` or `psycopg` for PostgreSQL, `pymysql` or `mysqlclient` for
MySQL, `pyodbc` for MS SQL Server, or `duckdb-engine` for DuckDB).

**Please read the [package documentation website](https://cre-dev.github.io/xml2db) for all the details!**

## Installation

The package can be installed, preferably in a virtual environment, using `pip`:

``` bash
pip install xml2db
```

## Testing

Running the tests requires installing additional development dependencies, after cloning the repo, with:
Expand Down
70 changes: 1 addition & 69 deletions docs/api/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ for lower level steps. It can be useful for advanced use cases, for instance:

* transforming the data in intermediate steps,
* adding logging,
* limiting concurrent access to the database within a multiprocess setup, etc.
* limiting concurrent access to the database in a [multiprocessing context](../how_it_works.md#multiprocessing), etc.

For those scenarios you can easily reimplement
[`Document.insert_into_target_tables`](document.md/#xml2db.document.Document.insert_into_target_tables) to suit your
Expand All @@ -59,74 +59,6 @@ flowchart TB
end
```

### Multiprocessing example

XML parsing is CPU-bound and scales well across processes. Loading into the
database, however, must be coordinated to avoid conflicts on shared tables.
The right level of synchronisation depends on the backend:

* **DuckDB (file-based)**: only one active writer is allowed at a time, so
all database I/O must be serialised.
* **PostgreSQL, MS SQL Server, …**: concurrent writes to *different* temp
tables are safe (each process gets a unique temp-table prefix), but the final
merge into the shared target tables should be serialised.

The simplest approach (and the one shown below) is to serialise the entire
database phase with a `multiprocessing.Lock`, keeping only the parsing step
parallel. This works correctly for all backends.

```python
import multiprocessing
from xml2db import DataModel


def load_one_file(xml_path, xsd_path, connection_string, lock):
# Each process creates its own DataModel with a unique temp_prefix.
model = DataModel(
xsd_file=xsd_path,
connection_string=connection_string,
)
# XML parsing is CPU-bound and runs in parallel across all processes.
doc = model.parse_xml(xml_path)

# Serialise all database I/O across processes.
with lock:
doc.insert_into_target_tables()
model.engine.dispose()


if __name__ == "__main__":
xsd_path = "schema.xsd"
connection_string = "duckdb:///data.duckdb"
xml_files = ["file1.xml", "file2.xml", "file3.xml"]

lock = multiprocessing.Lock()
processes = [
multiprocessing.Process(
target=load_one_file,
args=(xml_path, xsd_path, connection_string, lock),
)
for xml_path in xml_files
]
for p in processes:
p.start()
for p in processes:
p.join()
if p.exitcode != 0:
raise RuntimeError(f"Worker failed with exit code {p.exitcode}")
```

!!! Note
For backends that support concurrent writers, you can increase throughput
by splitting
[`Document.insert_into_target_tables`](document.md/#xml2db.document.Document.insert_into_target_tables)
into separate calls to
[`Document.insert_into_temp_tables`](document.md/#xml2db.document.Document.insert_into_temp_tables)
(run concurrently, since each process has a unique temp-table prefix, so
there are no collisions) and
[`Document.merge_into_target_tables`](document.md/#xml2db.document.Document.merge_into_target_tables)
(serialised via lock).

## *Advanced use:* get data from the database back to XML

The flow chart below presents data conversions used to get back data from the database into XML, showing the functions
Expand Down
116 changes: 116 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: "CLI usage"
description: "Reference for the xml2db command-line interface: import XML files, render ERDs and DDL, and launch the interactive browser explorer."
---

# CLI usage

The `xml2db` CLI provides three subcommands: `import`, `render`, and `serve`.

## xml2db import

Parse an XML file and load it into a database.

```
xml2db import XML_FILE XSD_FILE --connection-string DSN [options]
```

**Positional arguments:**

| Argument | Description |
|---|---|
| `XML_FILE` | Path to the XML file to import |
| `XSD_FILE` | Path to the XSD schema file |

**Options:**

| Option | Description |
|---|---|
| `--connection-string DSN`, `-d DSN` | SQLAlchemy connection string (required) |
| `--config FILE`, `-c FILE` | YAML model config file |
| `--db-schema SCHEMA` | Database schema to use |
| `--metadata KEY=VALUE`, `-m KEY=VALUE` | Metadata values for `metadata_columns` (repeatable) |
| `--short-name NAME` | Data model short name (default: `DocumentRoot`) |
| `--no-iterparse` | Use the recursive parser instead of iterparse (higher memory usage) |
| `--recover` | Attempt to parse malformed XML |
| `--validate` | Validate the XML against the schema before importing |

**Example:**

```bash
xml2db import file.xml schema.xsd \
--connection-string "postgresql+psycopg2://user:pw@host/db" \
--config model_config.yml \
--metadata source=file.xml
```

On success, the command prints the number of rows inserted and already-existing (deduplicated), with per-phase timings.

## xml2db render

Print an ERD, source/target tree, or DDL to stdout or a file, without starting a server.

```
xml2db render XSD_FILE [options]
```

**Positional arguments:**

| Argument | Description |
|---|---|
| `XSD_FILE` | Path to the XSD schema file |

**Options:**

| Option | Description |
|---|---|
| `--config FILE`, `-c FILE` | YAML model config file |
| `--db-names` | Use physical database identifiers in the ERD instead of logical names |
| `--db-type BACKEND` | Database backend for DDL output (`postgresql`, `mssql`, `mysql`, ...) |
| `--format FORMAT`, `-f FORMAT` | Output format: `erd` (default), `target-tree`, `source-tree`, or `ddl` |
| `--output FILE`, `-o FILE` | Write output to a file instead of stdout |
| `--short-name NAME` | Data model short name (default: `DocumentRoot`) |

**Examples:**

```bash
xml2db render schema.xsd --format erd
xml2db render schema.xsd --format target-tree
xml2db render schema.xsd --format source-tree
xml2db render schema.xsd --format ddl --db-type postgresql
xml2db render schema.xsd --format erd --output diagram.md
```

## xml2db serve

Launch an interactive schema explorer in the browser.

```
xml2db serve XSD_FILE [options]
```

The explorer shows four tabs: ERD, target tree, source tree, and DDL. The left panel is a YAML config editor with autocomplete for table names, field names, and all config options. Edits trigger an automatic rebuild. The **Save** button writes the config back to disk.

**Positional arguments:**

| Argument | Description |
|---|---|
| `XSD_FILE` | Path to the XSD schema file |

**Options:**

| Option | Description |
|---|---|
| `--config FILE`, `-c FILE` | YAML model config file to load on startup; Save writes it back to this path (default: `model_config.yml`) |
| `--db-type BACKEND` | Database backend for the DDL tab (`postgresql`, `mssql`, `mysql`, ...) |
| `--no-browser` | Do not open the browser automatically |
| `--port PORT`, `-p PORT` | HTTP port (default: `8765`) |
| `--short-name NAME` | Data model short name (default: `DocumentRoot`) |

**Example:**

```bash
xml2db serve schema.xsd --config model_config.yml --db-type postgresql
```

See [Getting started](getting_started.md) for a walkthrough of the explorer.
16 changes: 8 additions & 8 deletions docs/configuring.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ model_config = load_config("model_config.yml")

The following options can be passed as top-level keys of the model configuration `dict`:

* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
clustered columnstore indexes. The default value is `False` (disabled).
* `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
Expand All @@ -89,24 +92,21 @@ the declarative [`"transform": "skip"`](#skipping-fields) option is simpler.
similar with `document_tree_hook`, but it is called as soon as a node is completed, not waiting for the entire parsing to
finish. It is especially useful if you intend to filter out some nodes and reduce memory footprint while parsing. For
straightforward field exclusion, see [`"transform": "skip"`](#skipping-fields).
* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when
deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
default value is `False` (disabled).
* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
clustered columnstore indexes. The default value is `False` (disabled).
* `metadata_columns` (`list`): a list of extra columns that you want to add to the root table of your model. This is
useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified
as dicts, the only required keys are `name` and `type` (a SQLAlchemy type object); other keys will be passed directly
as keyword arguments to `sqlalchemy.Column`. Actual values need to be passed to
[`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for each
parsed documents, as a `dict`, using the `metadata` argument.
* `transform` (`false` or `"auto"`): set to `false` to disable all automatic field transformations globally: no joining of multi-value columns, no elevation of child tables, no collapsing of choice groups. The default `"auto"` applies all of these where applicable. Per-field `transform` and per-table `choice_transform` still override the global setting.
* `record_hash_column_name`: the column name to use to store records hash data (defaults to `xml2db_record_hash`).
* `record_hash_constructor`: a function used to build a hash, with a signature similar to `hashlib` constructor
functions (defaults to `hashlib.sha1`).
* `record_hash_size`: the byte size of the record hash (defaults to 20, which is the size of a `sha-1` hash).
* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when
deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
default value is `False` (disabled).
* `transform` (`false` or `"auto"`): set to `false` to disable all automatic field transformations globally: no joining of multi-value columns, no elevation of child tables, no collapsing of choice groups. The default `"auto"` applies all of these where applicable. Per-field `transform` and per-table `choice_transform` still override the global setting.

## Fields configuration

Expand Down
7 changes: 4 additions & 3 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ This opens a browser with four tabs:
- **Source tree**: a text tree of the raw XSD structure before simplification
- **DDL**: `CREATE TABLE` statements for the target schema

The left panel is a YAML config editor with autocomplete for table names, field names, and all config options. Edit the config and the diagram updates automatically (with a short debounce). When the config looks right, click **Save** to write it to a file (default: `model_config.yml`).
The left panel is a YAML config editor with autocomplete for table names, field names, and all config options. Edit the config and the diagram updates automatically.
When the config looks right, click **Save** to write it to a file (default: `model_config.yml`).

You can also render these representations directly to stdout or a file without the browser:

Expand All @@ -54,7 +55,7 @@ See [Configuring your data model](configuring.md) for a full description of the

## Importing XML files

Once the data model looks right, import an XML file into the database:
Once you are happy with the data model, import an XML file into the database (config is optional):

``` bash
xml2db import file.xml schema.xsd \
Expand Down Expand Up @@ -96,7 +97,7 @@ with open("data_model_erd.md", "w") as f:
f.write(data_model.get_entity_rel_diagram())
```

The diagram uses [Mermaid](https://mermaid.js.org/syntax/entityRelationshipDiagram.html). PyCharm and GitHub both render Mermaid natively.
The diagram uses [Mermaid](https://mermaid.js.org/syntax/entityRelationshipDiagram.html). Your IDE should be able to render Mermaid preview.

``` py title="Write source and target trees to files" linenums="1"
with open("source_tree.txt", "w") as f:
Expand Down
Loading
Loading