What's in docx-corpus
Schema, coverage, and access methods. Real numbers from api.docxcorp.us/stats + direct DB queries.
| Bucket | Count |
|---|---|
| Total uploaded | 1,101,537 |
| Classified (type + topic + language) | 736,242 |
| Extracted, awaiting classification | 267,539 |
| Uploaded, awaiting extraction | 93,440 |
| Extracted with empty text | 4,316 |
| Duplicate (cross-crawl) | 241,993 |
| Failed (WARC or invalid .docx) | 117,862 |
The classified subset is what the browser, API, and HuggingFace dataset expose by default. The pipeline-backlog buckets move into the classified set as extraction and classification catch up. Failed, duplicate, and empty-text rows stay in the database for accounting and do not reach the classified set under the current pipeline.
| Field | Type | Description |
|---|---|---|
| id | string | SHA-256 of raw .docx bytes; also the R2 storage key |
| filename | string | Original filename from source URL |
| type | enum (10) | Form/structure label, see classification |
| topic | enum (9) | Subject domain label |
| language | ISO 639-1 | Detected by lingua, 76 distinct values |
| word_count | int | Median 566, mean 2,795 |
| confidence | float | min(type_conf, topic_conf), see quality |
| url | string | https://docxcorp.us/documents/{id}.docx |
Extracted text available at https://docxcorp.us/extracted/{id}.txt. Raw .docx and extracted text both return X-Robots-Tag: noindex.
See /download for code examples.
Dataset metadata: ODC-BY 1.0. Pipeline source: MIT. Individual document copyright remains with the original author. Takedown: [email protected].
docx-corpus (2026). Open corpus of classified Word documents from the public web. https://docxcorp.us. Built by SuperDoc.