Methodology

How docx-corpus classifies documents

Two-dimensional taxonomy. Documents are labeled along document_type (form) and topic (subject) independently. 90 combinations possible.

Document types · 10

ID	Label	Covers
legal	Legal	Contracts, NDAs, terms, regulations
forms	Forms & Applications	Applications, surveys, ballots
reports	Reports & Analysis	Annual reports, research papers, case studies
policies	Policies & Procedures	Privacy policies, handbooks, SOPs
educational	Educational	Syllabi, lesson plans, theses
correspondence	Correspondence	Letters, memos, press releases
technical	Technical Docs	Manuals, API docs, specifications
administrative	Administrative	Meeting minutes, agendas
creative	Creative & Marketing	Brochures, pitch decks
reference	Reference & Catalogs	Catalogs, directories, FAQs

Topics · 9

government · education · healthcare · finance · legal_judicial · technology · environment · nonprofit · general

How labels are produced

We follow the FineWeb-Edu pattern: sample a small, stratified set of documents across the five most-common languages, label each with an LLM, then train a lightweight multilingual text classifier on those labels and apply it at scale.

The classifier is a fine-tuned XLM-RoBERTa, trained on roughly 3,500 LLM-labeled examples. Two independent classifiers are trained, one per dimension. Code is open: see the classification scripts on GitHub.

Known limitations

Labeler ceiling. Claude Haiku 4.5 is the upper bound. No human-annotated test set published.
English-trained, multilingual-applied. Sample drawn from 5 languages; classifier runs over 76. Accuracy on out-of-sample languages not measured.
Topic skew. Government (33%) and education (25%) dominate. Reflects what's published on .gov and .edu, not a sampling choice.
Hard cases get one label. Documents spanning types still pick a single label. Filter by confidence to remove ambiguous cases.

/dataset /quality /download