cre-dev · martinv13 · Jun 26, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/README.md b/README.md
@@ -1,15 +1,64 @@
 # Loading XML files into a relational database
 
-`xml2db` is a Python package which allows parsing and loading XML files into a relational database. It handles complex 
+`xml2db` is a Python package that parses and loads XML files into a relational database. It handles complex 
 XML files which cannot be denormalized to flat tables, and works out of the box, without any custom mapping rules.
 
-It can be used within an [Extract, Load, Transform](https://docs.getdbt.com/terms/elt) data pipeline pattern as it 
-allows loading XML files into a relational data model which is very close from the source data, yet easy to work with.
+It fits naturally into an [Extract, Load, Transform](https://docs.getdbt.com/terms/elt) pipeline: it 
+loads XML files into a relational data model that stays close to the source data while remaining easy to query as 
+flat database tables. The raw data can then be transformed using [DBT](https://www.getdbt.com/), SQL views, or stored procedures to produce
+more user-friendly tables.
 
 Starting from an XSD schema which represents a given XML structure, `xml2db` builds a data model, i.e. a set of database 
 tables linked to each other by foreign keys relationships. Then, it allows parsing and loading XML files into the 
 database, and getting them back from the database into XML format if needed.
 
+This package uses `sqlalchemy` to interact with the database, so it should work with different database backends. 
+Automated integration tests run against PostgreSQL, MySQL, MS SQL Server and DuckDB. You may have to install additional 
+packages to connect to your database (e.g. `psycopg2` or `psycopg` for PostgreSQL, `pymysql` or `mysqlclient` for
+MySQL, `pyodbc` for MS SQL Server, or `duckdb-engine` for DuckDB).
+
+**Please read the [package documentation website](https://cre-dev.github.io/xml2db) for all the details!**
+
+## Installation
+
+The package can be installed, preferably in a virtual environment, using `pip`:
+
+``` bash
+pip install xml2db
+```
+
+## CLI
+
+After installation, `xml2db` is available as a command-line tool with three subcommands.
+
+Explore your XSD schema and configure the data model interactively in a browser:
+
+```bash
+xml2db serve path/to/schema.xsd
+```
+
+This opens a page with an Entity Relationship Diagram, source/target tree views, DDL output, and a live YAML config 
+editor with autocomplete.
+
+Import an XML file directly from the command line:
+
+```bash
+xml2db import file.xml schema.xsd \
+    --connection-string "postgresql+psycopg2://user:pw@host/db" \
+    --config model_config.yml
+```
+
+Render the ERD, trees, or DDL to stdout or a file without starting a server:
+
+```bash
+xml2db render schema.xsd --format erd
+xml2db render schema.xsd --format ddl --db-type postgresql
+```
+
+See the [CLI reference](https://cre-dev.github.io/xml2db/cli/) for all options.
+
+## Python API
+
 Loading XML files into a relational database with `xml2db` can be as simple as:
 
 ```python
@@ -28,28 +77,6 @@ document = data_model.parse_xml(
 document.insert_into_target_tables()
 ```
 
-The data model created by `xml2db` will be close to the XSD schema. However, `xml2db` will perform a few systematic 
-simplifications aimed at limiting the complexity of the resulting data model and the storage footprint. The resulting 
-data model can be configured, but the above code will work out of the box, with reasonable defaults.
-
-The raw data loaded into the database can then be processed if need be, using for instance [DBT](https://www.getdbt.com/),
-SQL views or stored procedures aimed at extracting, correcting and formatting the data into more user-friendly tables.
-
-This package uses `sqlalchemy` to interact with the database, so it should work with different database backends. 
-Automated integration tests run against PostgreSQL, MySQL, MS SQL Server and DuckDB. You may have to install additional 
-packages to connect to your database (e.g. `psycopg2` or `psycopg` for PostgreSQL, `pymysql` or `mysqlclient` for
-MySQL, `pyodbc` for MS SQL Server, or `duckdb-engine` for DuckDB).
-
-**Please read the [package documentation website](https://cre-dev.github.io/xml2db) for all the details!**
-
-## Installation
-
-The package can be installed, preferably in a virtual environment, using `pip`:
-
-``` bash
-pip install xml2db
-```
-
 ## Testing
 
 Running the tests requires installing additional development dependencies, after cloning the repo, with:

diff --git a/docs/api/overview.md b/docs/api/overview.md
@@ -38,7 +38,7 @@ for lower level steps. It can be useful for advanced use cases, for instance:
 
 * transforming the data in intermediate steps,
 * adding logging,
-* limiting concurrent access to the database within a multiprocess setup, etc.
+* limiting concurrent access to the database in a [multiprocessing context](../how_it_works.md#multiprocessing), etc.
 
 For those scenarios you can easily reimplement 
 [`Document.insert_into_target_tables`](document.md/#xml2db.document.Document.insert_into_target_tables) to suit your 
@@ -59,74 +59,6 @@ flowchart TB
     end
 ```
 
-### Multiprocessing example
-
-XML parsing is CPU-bound and scales well across processes. Loading into the
-database, however, must be coordinated to avoid conflicts on shared tables.
-The right level of synchronisation depends on the backend:
-
-* **DuckDB (file-based)**: only one active writer is allowed at a time, so
-  all database I/O must be serialised.
-* **PostgreSQL, MS SQL Server, …**: concurrent writes to *different* temp
-  tables are safe (each process gets a unique temp-table prefix), but the final
-  merge into the shared target tables should be serialised.
-
-The simplest approach (and the one shown below) is to serialise the entire
-database phase with a `multiprocessing.Lock`, keeping only the parsing step
-parallel. This works correctly for all backends.
-
-```python
-import multiprocessing
-from xml2db import DataModel
-
-
-def load_one_file(xml_path, xsd_path, connection_string, lock):
-    # Each process creates its own DataModel with a unique temp_prefix.
-    model = DataModel(
-        xsd_file=xsd_path,
-        connection_string=connection_string,
-    )
-    # XML parsing is CPU-bound and runs in parallel across all processes.
-    doc = model.parse_xml(xml_path)
-
-    # Serialise all database I/O across processes.
-    with lock:
-        doc.insert_into_target_tables()
-        model.engine.dispose()
-
-
-if __name__ == "__main__":
-    xsd_path = "schema.xsd"
-    connection_string = "duckdb:///data.duckdb"
-    xml_files = ["file1.xml", "file2.xml", "file3.xml"]
-
-    lock = multiprocessing.Lock()
-    processes = [
-        multiprocessing.Process(
-            target=load_one_file,
-            args=(xml_path, xsd_path, connection_string, lock),
-        )
-        for xml_path in xml_files
-    ]
-    for p in processes:
-        p.start()
-    for p in processes:
-        p.join()
-        if p.exitcode != 0:
-            raise RuntimeError(f"Worker failed with exit code {p.exitcode}")
-```
-
-!!! Note
-    For backends that support concurrent writers, you can increase throughput
-    by splitting
-    [`Document.insert_into_target_tables`](document.md/#xml2db.document.Document.insert_into_target_tables)
-    into separate calls to
-    [`Document.insert_into_temp_tables`](document.md/#xml2db.document.Document.insert_into_temp_tables)
-    (run concurrently, since each process has a unique temp-table prefix, so
-    there are no collisions) and
-    [`Document.merge_into_target_tables`](document.md/#xml2db.document.Document.merge_into_target_tables)
-    (serialised via lock).
-
 ## *Advanced use:* get data from the database back to XML
 
 The flow chart below presents data conversions used to get back data from the database into XML, showing the functions 

diff --git a/docs/cli.md b/docs/cli.md
@@ -0,0 +1,116 @@
+---
+title: "CLI usage"
+description: "Reference for the xml2db command-line interface: import XML files, render ERDs and DDL, and launch the interactive browser explorer."
+---
+
+# CLI usage
+
+The `xml2db` CLI provides three subcommands: `import`, `render`, and `serve`.
+
+## xml2db import
+
+Parse an XML file and load it into a database.
+
+```
+xml2db import XML_FILE XSD_FILE --connection-string DSN [options]
+```
+
+**Positional arguments:**
+
+| Argument | Description |
+|---|---|
+| `XML_FILE` | Path to the XML file to import |
+| `XSD_FILE` | Path to the XSD schema file |
+
+**Options:**
+
+| Option | Description |
+|---|---|
+| `--connection-string DSN`, `-d DSN` | SQLAlchemy connection string (required) |
+| `--config FILE`, `-c FILE` | YAML model config file |
+| `--db-schema SCHEMA` | Database schema to use |
+| `--metadata KEY=VALUE`, `-m KEY=VALUE` | Metadata values for `metadata_columns` (repeatable) |
+| `--short-name NAME` | Data model short name (default: `DocumentRoot`) |
+| `--no-iterparse` | Use the recursive parser instead of iterparse (higher memory usage) |
+| `--recover` | Attempt to parse malformed XML |
+| `--validate` | Validate the XML against the schema before importing |
+
+**Example:**
+
+```bash
+xml2db import file.xml schema.xsd \
+    --connection-string "postgresql+psycopg2://user:pw@host/db" \
+    --config model_config.yml \
+    --metadata source=file.xml
+```
+
+On success, the command prints the number of rows inserted and already-existing (deduplicated), with per-phase timings.
+
+## xml2db render
+
+Print an ERD, source/target tree, or DDL to stdout or a file, without starting a server.
+
+```
+xml2db render XSD_FILE [options]
+```
+
+**Positional arguments:**
+
+| Argument | Description |
+|---|---|
+| `XSD_FILE` | Path to the XSD schema file |
+
+**Options:**
+
+| Option | Description |
+|---|---|
+| `--config FILE`, `-c FILE` | YAML model config file |
+| `--db-names` | Use physical database identifiers in the ERD instead of logical names |
+| `--db-type BACKEND` | Database backend for DDL output (`postgresql`, `mssql`, `mysql`, ...) |
+| `--format FORMAT`, `-f FORMAT` | Output format: `erd` (default), `target-tree`, `source-tree`, or `ddl` |
+| `--output FILE`, `-o FILE` | Write output to a file instead of stdout |
+| `--short-name NAME` | Data model short name (default: `DocumentRoot`) |
+
+**Examples:**
+
+```bash
+xml2db render schema.xsd --format erd
+xml2db render schema.xsd --format target-tree
+xml2db render schema.xsd --format source-tree
+xml2db render schema.xsd --format ddl --db-type postgresql
+xml2db render schema.xsd --format erd --output diagram.md
+```
+
+## xml2db serve
+
+Launch an interactive schema explorer in the browser.
+
+```
+xml2db serve XSD_FILE [options]
+```
+
+The explorer shows four tabs: ERD, target tree, source tree, and DDL. The left panel is a YAML config editor with autocomplete for table names, field names, and all config options. Edits trigger an automatic rebuild. The **Save** button writes the config back to disk.
+
+**Positional arguments:**
+
+| Argument | Description |
+|---|---|
+| `XSD_FILE` | Path to the XSD schema file |
+
+**Options:**
+
+| Option | Description |
+|---|---|
+| `--config FILE`, `-c FILE` | YAML model config file to load on startup; Save writes it back to this path (default: `model_config.yml`) |
+| `--db-type BACKEND` | Database backend for the DDL tab (`postgresql`, `mssql`, `mysql`, ...) |
+| `--no-browser` | Do not open the browser automatically |
+| `--port PORT`, `-p PORT` | HTTP port (default: `8765`) |
+| `--short-name NAME` | Data model short name (default: `DocumentRoot`) |
+
+**Example:**
+
+```bash
+xml2db serve schema.xsd --config model_config.yml --db-type postgresql
+```
+
+See [Getting started](getting_started.md) for a walkthrough of the explorer.
diff --git a/docs/configuring.md b/docs/configuring.md
@@ -80,6 +80,9 @@ model_config = load_config("model_config.yml")
 
 The following options can be passed as top-level keys of the model configuration `dict`:
 
+* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
+the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
+clustered columnstore indexes. The default value is `False` (disabled).
 * `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
 access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
 for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
@@ -89,24 +92,21 @@ the declarative [`"transform": "skip"`](#skipping-fields) option is simpler.
 similar with `document_tree_hook`, but it is called as soon as a node is completed, not waiting for the entire parsing to
 finish. It is especially useful if you intend to filter out some nodes and reduce memory footprint while parsing. For
 straightforward field exclusion, see [`"transform": "skip"`](#skipping-fields).
-* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when 
-deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
-always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The 
-default value is `False` (disabled).
-* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
-the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
-clustered columnstore indexes. The default value is `False` (disabled).
 * `metadata_columns` (`list`): a list of extra columns that you want to add to the root table of your model. This is
 useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified
 as dicts, the only required keys are `name` and `type` (a SQLAlchemy type object); other keys will be passed directly
 as keyword arguments to `sqlalchemy.Column`. Actual values need to be passed to 
 [`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for each 
 parsed documents, as a `dict`, using the `metadata` argument.
-* `transform` (`false` or `"auto"`): set to `false` to disable all automatic field transformations globally: no joining of multi-value columns, no elevation of child tables, no collapsing of choice groups. The default `"auto"` applies all of these where applicable. Per-field `transform` and per-table `choice_transform` still override the global setting.
 * `record_hash_column_name`: the column name to use to store records hash data (defaults to `xml2db_record_hash`).
 * `record_hash_constructor`: a function used to build a hash, with a signature similar to `hashlib` constructor 
 functions (defaults to `hashlib.sha1`).
 * `record_hash_size`: the byte size of the record hash (defaults to 20, which is the size of a `sha-1` hash).
+* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when 
+deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
+always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The 
+default value is `False` (disabled).
+* `transform` (`false` or `"auto"`): set to `false` to disable all automatic field transformations globally: no joining of multi-value columns, no elevation of child tables, no collapsing of choice groups. The default `"auto"` applies all of these where applicable. Per-field `transform` and per-table `choice_transform` still override the global setting.
 
 ## Fields configuration
 

diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -39,7 +39,8 @@ This opens a browser with four tabs:
 - **Source tree**: a text tree of the raw XSD structure before simplification
 - **DDL**: `CREATE TABLE` statements for the target schema
 
-The left panel is a YAML config editor with autocomplete for table names, field names, and all config options. Edit the config and the diagram updates automatically (with a short debounce). When the config looks right, click **Save** to write it to a file (default: `model_config.yml`).
+The left panel is a YAML config editor with autocomplete for table names, field names, and all config options. Edit the config and the diagram updates automatically. 
+When the config looks right, click **Save** to write it to a file (default: `model_config.yml`).
 
 You can also render these representations directly to stdout or a file without the browser:
 
@@ -54,7 +55,7 @@ See [Configuring your data model](configuring.md) for a full description of the
 
 ## Importing XML files
 
-Once the data model looks right, import an XML file into the database:
+Once you are happy with the data model, import an XML file into the database (config is optional):
 
 ``` bash
 xml2db import file.xml schema.xsd \
@@ -96,7 +97,7 @@ with open("data_model_erd.md", "w") as f:
     f.write(data_model.get_entity_rel_diagram())
 ```
 
-The diagram uses [Mermaid](https://mermaid.js.org/syntax/entityRelationshipDiagram.html). PyCharm and GitHub both render Mermaid natively.
+The diagram uses [Mermaid](https://mermaid.js.org/syntax/entityRelationshipDiagram.html). Your IDE should be able to render Mermaid preview.
 
 ``` py title="Write source and target trees to files" linenums="1"
 with open("source_tree.txt", "w") as f: