The largest open DOCX dataset and document corpus.
Real .docx files from the public web, validated, deduplicated, and labeled by type, topic, and language.
Built by SuperDoc.
docx-corpus is the largest open corpus of classified Word documents on the public web. It contains 736,000+ real .docx files collected from Common Crawl, validated, deduplicated, and labeled by document type and topic. It is the first large-scale research dataset for DOCX, the world's most-used document creation format, which has historically been absent from document AI work built on scanned images and PDFs.
Each document has a content hash, file size, detected language, and a type and topic label from a fine-tuned XLM-RoBERTa text classifier. The corpus spans 46+ languages and uses a 10-type, 9-topic taxonomy. Available as a HuggingFace dataset, a REST API at api.docxcorp.us, and downloadable manifest files. Built by SuperDoc. Source on GitHub. Dataset is ODC-BY; source code is MIT.
How to download
Use the filters above to select a subset, then click Download manifest to get a text file with one URL per line.
# Download all files in the manifest
wget -i manifest.txt -P ./corpus/
# Or with curl
xargs -n 1 curl -O < manifest.txt
# Or fetch directly via the API
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt | Document | Type | Topic | Lang | Confidence |
|---|