Quality & limits

Validation, confidence, and what's known to be weak

For researchers deciding whether docx-corpus is fit for their task.

Validation gates

Every candidate file must be a real, openable DOCX before entering the corpus. We verify the ZIP container is well-formed and that the OOXML parts a Word document is required to have are present. Files that fail any check are recorded as failed (URL only, no stored bytes) so subsequent re-runs don't re-fetch them.

Current failed count: 117,862.

Deduplication

Content-addressed by SHA-256 of the raw .docx bytes. Two URLs pointing at byte-identical files collapse to one storage record. Current dedup outcome: 241,993 duplicate records pointing at 1,101,537 unique-content uploads.

Exact dedup only. Near-duplicates (same template with different filled-in values, or re-saved with metadata-only changes) are not detected.

Confidence distribution · n=736,242

0.9-1.0

49.8%

0.8-0.9

14.8%

0.7-0.8

10.1%

0.6-0.7

9.1%

0.5-0.6

8.8%

0.4-0.5

5.4%

0.3-0.4

1.8%

below 0.3

0.2%

Half the corpus is at 0.9+. For label-noise-sensitive tasks, filter on min_confidence=0.7 (~76% of corpus) or 0.8 (~65%). The score is uncalibrated softmax; use as a relative ranking, not a probability of correctness.

Known limitations

Common Crawl bias. Government and education domains over-publish .docx; intranets and login-walled content are absent.
Topic skew. Government (33%) + education (25%) = 58% of classified.
English-heavy. English is 33% of classified. Smaller languages present but small in share.
No human evaluation set. Classification accuracy is bounded by the LLM labeler.
Near-duplicates. Same template with different fields remain as distinct documents.
Lang detection on short docs. Lingua is reliable on full text, less so on short/symbol-heavy content (the unknown bucket).
Word counts cover extracted text only. Text in images is not counted.
License covers metadata, not originals. Document copyright stays with the author.

/dataset /classification /download