Methodology

How docx-corpus classifies documents

Two-dimensional taxonomy. Documents are labeled along document_type (form) and topic (subject) independently. 90 combinations possible.

IDLabelCovers
legalLegalContracts, NDAs, terms, regulations
formsForms & ApplicationsApplications, surveys, ballots
reportsReports & AnalysisAnnual reports, research papers, case studies
policiesPolicies & ProceduresPrivacy policies, handbooks, SOPs
educationalEducationalSyllabi, lesson plans, theses
correspondenceCorrespondenceLetters, memos, press releases
technicalTechnical DocsManuals, API docs, specifications
administrativeAdministrativeMeeting minutes, agendas
creativeCreative & MarketingBrochures, pitch decks
referenceReference & CatalogsCatalogs, directories, FAQs

government · education · healthcare · finance · legal_judicial · technology · environment · nonprofit · general

We follow the FineWeb-Edu pattern: sample a small, stratified set of documents across the five most-common languages, label each with an LLM, then train a lightweight multilingual text classifier on those labels and apply it at scale.

The classifier is a fine-tuned XLM-RoBERTa, trained on roughly 3,500 LLM-labeled examples. Two independent classifiers are trained, one per dimension. Code is open: see the classification scripts on GitHub.

  1. Labeler ceiling. Claude Haiku 4.5 is the upper bound. No human-annotated test set published.
  2. English-trained, multilingual-applied. Sample drawn from 5 languages; classifier runs over 76. Accuracy on out-of-sample languages not measured.
  3. Topic skew. Government (33%) and education (25%) dominate. Reflects what's published on .gov and .edu, not a sampling choice.
  4. Hard cases get one label. Documents spanning types still pick a single label. Filter by confidence to remove ambiguous cases.
/dataset/quality/download