Skip to content

[BUG MODEL]: mistral-ocr-latest structured annotation (strict json_schema) loops on number fields, truncating the JSON output #582

Description

@adamlarkem-wq

Model

mistral-medium-latest

Request Payload


Request payload

POST /v1/ocr (via client.ocr.process), model mistral-ocr-latest (also reproduced on mistral-ocr-4-0), structured document annotation with strict: true and number fields:

{
  "model": "mistral-ocr-latest",
  "document": {
    "type": "document_url",
    "document_url": "data:application/pdf;base64,<PDF_BASE64>"
  },
  "include_image_base64": true,
  "include_blocks": true,
  "document_annotation_prompt": "Extract the invoice fields into the schema. Numbers as plain decimals.",
  "document_annotation_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "invoice",
      "strict": true,
      "schema": {
        "type": "object",
        "additionalProperties": false,
        "required": ["company", "invoice_number", "items"],
        "$defs": {
          "Item": {
            "type": "object",
            "additionalProperties": false,
            "properties": {
              "product_code":     {"anyOf": [{"type": "string"}, {"type": "null"}]},
              "description":       {"anyOf": [{"type": "string"}, {"type": "null"}]},
              "quantity":          {"anyOf": [{"type": "number"}, {"type": "null"}]},
              "unit_price":        {"anyOf": [{"type": "number"}, {"type": "null"}]},
              "line_total_price":  {"anyOf": [{"type": "number"}, {"type": "null"}]}
            }
          }
        },
        "properties": {
          "company":         {"type": "string"},
          "invoice_number":  {"type": "string"},
          "items":           {"type": "array", "items": {"$ref": "#/$defs/Item"}},
          "subtotal_amount": {"anyOf": [{"type": "number"}, {"type": "null"}]},
          "tax_amount":      {"anyOf": [{"type": "number"}, {"type": "null"}]},
          "total_amount":    {"anyOf": [{"type": "number"}, {"type": "null"}]}
        }
      }
    }
  }
}

Input: a single-page PDF invoice with decimal line amounts (attached: sample.pdf).


Output

Output

The returned document_annotation is truncated, invalid JSON: a number field is emitted as the correct value, then expands into its full float64 decimal representation followed by an endless run of digits, never closing the value (company/invoice anonymized; the runaway line_total_price is verbatim):

{
  "model": "mistral-ocr-latest",
  "document_annotation": "{\"company\": \"ACME ELECTRICAL\", \"invoice_number\": \"PO-000000\", \"items\": [{\"product_code\": \"99999", \"description\": \"LIGHT 1200MM\", \"line_total_price\": 1487.500000000000255795384873613816452026367187500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001402960000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
  "usage_info": {
    "pages_processed": 1,
    "doc_size_bytes": 45532
  }
}

Intended value was 1487.5. Output = 1487.5 + its float64 binary→decimal tail (...0255795384873613816...) + an unbounded digit run → the value and the rest of the document are never closed.


Expected Behavior

Each number field is emitted as a finite decimal and the response is well-formed JSON conforming to the schema (e.g. "line_total_price": 1487.5).


Additional Context

Environment: mistralai Python SDK, endpoint ocr.process document annotations. Models mistral-ocr-latest (= mistral-ocr-2505) and mistral-ocr-4-0 both affected.

Reproduction conditions:

  • Reproduces only with strict: true. Same schema / model / prompt / PDF does not degenerate when strict is off appears to be a constrained-decoding interaction with the unbounded JSON number grammar.
  • Requires number (float) fields in the schema and a PDF with decimal values.
  • Intermittent and document-dependent: ~70% of runs on one specific invoice, ~0% on other invoices, ~10% across a set of 20. Same document + same request, re-running eventually triggers it.
  • include_blocks / include_image_base64 have no effect.
  • The truncation always lands inside a number field.

How we found it: extracting invoices at scale, some documents intermittently returned unparseable JSON. Re-running the exact same request on the same PDF reproduced it ~70% of the time on the worst document.

Reproduction note: the attached sample.pdf triggers it ~70% of runs, please re-run a few times if the first is clean. Happy to share more sample documents privately.


Suggested Solutions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions