Validation, confidence, and what's known to be weak
For researchers deciding whether docx-corpus is fit for their task.
Every candidate file must be a real, openable DOCX before entering the corpus. We verify the ZIP container is well-formed and that the OOXML parts a Word document is required to have are present. Files that fail any check are recorded as failed (URL only, no stored bytes) so subsequent re-runs don't re-fetch them.
Current failed count: 117,862.
Content-addressed by SHA-256 of the raw .docx bytes. Two URLs pointing at byte-identical files collapse to one storage record. Current dedup outcome: 241,993 duplicate records pointing at 1,101,537 unique-content uploads.
Exact dedup only. Near-duplicates (same template with different filled-in values, or re-saved with metadata-only changes) are not detected.
Half the corpus is at 0.9+. For label-noise-sensitive tasks, filter on min_confidence=0.7 (~76% of corpus) or 0.8 (~65%). The score is uncalibrated softmax; use as a relative ranking, not a probability of correctness.
- Common Crawl bias. Government and education domains over-publish .docx; intranets and login-walled content are absent.
- Topic skew. Government (33%) + education (25%) = 58% of classified.
- English-heavy. English is 33% of classified. Smaller languages present but small in share.
- No human evaluation set. Classification accuracy is bounded by the LLM labeler.
- Near-duplicates. Same template with different fields remain as distinct documents.
- Lang detection on short docs. Lingua is reliable on full text, less so on short/symbol-heavy content (the
unknownbucket). - Word counts cover extracted text only. Text in images is not counted.
- License covers metadata, not originals. Document copyright stays with the author.