Quality & limits

Validation, confidence, and what's known to be weak

For researchers deciding whether docx-corpus is fit for their task.

Every candidate file must be a real, openable DOCX before entering the corpus. We verify the ZIP container is well-formed and that the OOXML parts a Word document is required to have are present. Files that fail any check are recorded as failed (URL only, no stored bytes) so subsequent re-runs don't re-fetch them.

Current failed count: 117,862.

Content-addressed by SHA-256 of the raw .docx bytes. Two URLs pointing at byte-identical files collapse to one storage record. Current dedup outcome: 241,993 duplicate records pointing at 1,101,537 unique-content uploads.

Exact dedup only. Near-duplicates (same template with different filled-in values, or re-saved with metadata-only changes) are not detected.

0.9-1.0
49.8%
0.8-0.9
14.8%
0.7-0.8
10.1%
0.6-0.7
9.1%
0.5-0.6
8.8%
0.4-0.5
5.4%
0.3-0.4
1.8%
below 0.3
0.2%

Half the corpus is at 0.9+. For label-noise-sensitive tasks, filter on min_confidence=0.7 (~76% of corpus) or 0.8 (~65%). The score is uncalibrated softmax; use as a relative ranking, not a probability of correctness.

  1. Common Crawl bias. Government and education domains over-publish .docx; intranets and login-walled content are absent.
  2. Topic skew. Government (33%) + education (25%) = 58% of classified.
  3. English-heavy. English is 33% of classified. Smaller languages present but small in share.
  4. No human evaluation set. Classification accuracy is bounded by the LLM labeler.
  5. Near-duplicates. Same template with different fields remain as distinct documents.
  6. Lang detection on short docs. Lingua is reliable on full text, less so on short/symbol-heavy content (the unknown bucket).
  7. Word counts cover extracted text only. Text in images is not counted.
  8. License covers metadata, not originals. Document copyright stays with the author.
/dataset/classification/download