Solving Hierarchical List Parsing in Legal Documents (Without LLM Guessing)

When working with legal or regulatory documents (such as EU legislation), one of the deceptively hard problems is correctly modeling hierarchical lists. These documents are full of nested structures like:

  • numbered paragraphs (1., 2.),
  • lettered items ((a), (b)),
  • roman numerals ((i), (ii), (iii)),

often mixed with free-text paragraphs, definitions, and exceptions.

At first glance, this looks simple. In practice, it’s one of the main sources of downstream errors in search, retrieval, and AI-assisted answering.

The Core Problem

HTML representations of legal texts (e.g. EUR-Lex) are structurally inconsistent:

  • nesting depth is not reliable,
  • list items are often rendered using generic <div> grids,
  • numbering may reset visually without resetting the DOM hierarchy,
  • multiple paragraphs can belong to the same logical list item.

If you naïvely chunk text or rely on DOM depth alone, you end up with:

  • definitions split across chunks,
  • list items grouped incorrectly,
  • or worst of all: unrelated provisions merged together.

Once this happens, downstream agents or LLMs are forced to guess structure — which leads to hallucinations, missing conditions, or incorrect legal interpretations.

The Design Goal

The goal was not to “understand” the document using an LLM.

The goal was to:

  • encode the document’s logical structure deterministically at index time, so that:
    • list hierarchy is explicit,
    • grouping is stable,
    • and retrieval can be purely mechanical.

In other words: make the data correct so the AI doesn’t have to be clever.

The Key Concepts Introduced

1. Marker Classification with Explicit Levels

Each list marker is classified into a marker level:

  • Level 1 → numeric (1., 2.)
  • Level 2 → letter ((a), (b))
  • Level 3 → roman ((i), (ii))

Crucially, roman numerals are detected before single letters — otherwise (i) is easily misclassified as a letter.

2. Block Keys as the Source of Truth

Instead of relying on paragraph IDs or HTML structure, each logical list block gets a block key:

  • a stable hash representing “these chunks belong together”.

Rules:

  • Level-1 markers define containers (e.g. “The following definitions apply”).
  • Level-2 markers define semantic blocks (e.g. a single legal definition).
  • Level-3 markers inherit the block key from their parent Level-2 item.

This allows:

  • one definition = one block,
  • regardless of how many paragraphs or list items it spans.

3. Deterministic Ordering

Each chunk inside a block is assigned an ordinal_in_block.

This guarantees:

  • correct reconstruction order,
  • no reliance on token proximity or embedding similarity,
  • reproducible stitching every time.

Why This Matters

Once the indexer produces:

  • correct block keys,
  • correct marker levels,
  • stable ordering,

everything downstream becomes simpler:

  • Retrieval can fetch entire legal definitions with a single query.
  • Agents no longer mix conditions from different provisions.
  • Answers become complete, precise, and traceable.
  • Citations map cleanly to the exact legal paragraphs involved.

Most importantly: LLMs are no longer asked to infer document structure.

They only summarize or explain what is already correctly grouped.

Validation via the Index (Not Logs)

Instead of adding more debug logging, correctness was verified directly at the index level using structured queries:

  • “Show me everything with this block key.”
  • “Are all (i)–(iii) items in the same block?”
  • “Do (a) and (b) definitions stay separated?”

If the index answers those questions correctly, the pipeline is correct — no runtime heuristics required.

Takeaway

Hierarchical list parsing is not an LLM problem.

It’s a data modeling problem.

Once structure is captured deterministically during ingestion:

  • search becomes reliable,
  • answers become accurate,
  • and AI systems stop behaving “randomly” on legal texts.

The biggest win wasn’t better prompting —
it was making the index structurally honest.

That’s all folks!

Cheers!
Gašper Rupnik

{End.}

Leave a comment

Website Powered by WordPress.com.

Up ↑