table – RaspeR87's Blog

A Practical Guide to Handling Large Legal Tables in RAG Pipelines

When working with legal documents—especially EU legislation like EUR-Lex—you quickly run into a hard problem: tables.

Not small tables.
Not friendly tables.
But hundreds-row, multi-page tables buried inside 300+ page PDFs, translated into 20+ languages.

If you are building a Retrieval-Augmented Generation (RAG) system, naïvely embedding these tables almost always fails. You end up with embeddings that contain nothing more than:

“Table 1”

…and none of the actual data users are searching for.

This post describes a production-grade approach to handling large legal tables in a RAG pipeline, based on real issues encountered while indexing EU regulations (e.g. Regulation (EC) No 1333/2008).

The Core Problem

Let’s start with a real example from EUR-Lex:

ANNEX III
PART 6
Table 1 — Definitions of groups of food additives

The table itself contains hundreds of rows like:

E 170 — Calcium carbonate
E 260 — Acetic acid
E 261 — Potassium acetates
…

What goes wrong in many pipelines

The table heading (“Table 1”) is detected as a section.
The actual <table> element is ignored or stored separately.
Embeddings are generated from the heading text only.

Result:

Embedding text length: 7
Embedding content: "Table 1"

The data exists visually—but not semantically.

Design Goals

We defined a few non-negotiable goals:

The table must be searchable
Queries like “E170 calcium carbonate” must hit the table.
IDs must be stable and human-readable
ANNEX_III_PART_6_TABLE_1 is better than _TBL0.
Structured data must be preserved
We want JSON rows for precise answering, not just text.
Embeddings must stay within limits
Some tables have hundreds of rows.

Continue reading →

Tag: table

From “Table 1” to Searchable Knowledge

A Practical Guide to Handling Large Legal Tables in RAG Pipelines

The Core Problem

What goes wrong in many pipelines

Design Goals

Follow me on Twitter

Follow Me

RSS