Methodology
How docx-corpus classifies documents
Two-dimensional taxonomy. Documents are labeled along document_type (form) and topic (subject) independently. 90 combinations possible.
Document types · 10
| ID | Label | Covers |
|---|---|---|
| legal | Legal | Contracts, NDAs, terms, regulations |
| forms | Forms & Applications | Applications, surveys, ballots |
| reports | Reports & Analysis | Annual reports, research papers, case studies |
| policies | Policies & Procedures | Privacy policies, handbooks, SOPs |
| educational | Educational | Syllabi, lesson plans, theses |
| correspondence | Correspondence | Letters, memos, press releases |
| technical | Technical Docs | Manuals, API docs, specifications |
| administrative | Administrative | Meeting minutes, agendas |
| creative | Creative & Marketing | Brochures, pitch decks |
| reference | Reference & Catalogs | Catalogs, directories, FAQs |
Topics · 9
government · education · healthcare · finance · legal_judicial · technology · environment · nonprofit · general
How labels are produced
We follow the FineWeb-Edu pattern: sample a small, stratified set of documents across the five most-common languages, label each with an LLM, then train a lightweight multilingual text classifier on those labels and apply it at scale.
The classifier is a fine-tuned XLM-RoBERTa, trained on roughly 3,500 LLM-labeled examples. Two independent classifiers are trained, one per dimension. Code is open: see the classification scripts on GitHub.
Known limitations
- Labeler ceiling. Claude Haiku 4.5 is the upper bound. No human-annotated test set published.
- English-trained, multilingual-applied. Sample drawn from 5 languages; classifier runs over 76. Accuracy on out-of-sample languages not measured.
- Topic skew. Government (33%) and education (25%) dominate. Reflects what's published on .gov and .edu, not a sampling choice.
- Hard cases get one label. Documents spanning types still pick a single label. Filter by
confidenceto remove ambiguous cases.