Dataset card

What's in docx-corpus

Schema, coverage, and access methods. Real numbers from api.docxcorp.us/stats + direct DB queries.

BucketCount
Total uploaded1,101,537
Classified (type + topic + language)736,242
Extracted, awaiting classification267,539
Uploaded, awaiting extraction93,440
Extracted with empty text4,316
Duplicate (cross-crawl)241,993
Failed (WARC or invalid .docx)117,862

The classified subset is what the browser, API, and HuggingFace dataset expose by default. The pipeline-backlog buckets move into the classified set as extraction and classification catch up. Failed, duplicate, and empty-text rows stay in the database for accounting and do not reach the classified set under the current pipeline.

FieldTypeDescription
idstringSHA-256 of raw .docx bytes; also the R2 storage key
filenamestringOriginal filename from source URL
typeenum (10)Form/structure label, see classification
topicenum (9)Subject domain label
languageISO 639-1Detected by lingua, 76 distinct values
word_countintMedian 566, mean 2,795
confidencefloatmin(type_conf, topic_conf), see quality
urlstringhttps://docxcorp.us/documents/{id}.docx

Extracted text available at https://docxcorp.us/extracted/{id}.txt. Raw .docx and extracted text both return X-Robots-Tag: noindex.

HuggingFace superdoc-dev/docx-corpus · Parquet, bulk download
REST API api.docxcorp.us · faceted queries
Manifest api.docxcorp.us/manifest · URL list for wget/curl
Per document docxcorp.us/documents/{id}.docx

See /download for code examples.

Dataset metadata: ODC-BY 1.0. Pipeline source: MIT. Individual document copyright remains with the original author. Takedown: [email protected].

docx-corpus (2026). Open corpus of classified Word documents from the public web. https://docxcorp.us. Built by SuperDoc.

/classification/quality/download