Open Research Dataset

The largest open DOCX dataset and document corpus.

Real .docx files from the public web, validated, deduplicated, and labeled by type, topic, and language.

Built by SuperDoc.

documents
languages
taxonomy
avg confidence

docx-corpus is the largest open corpus of classified Word documents on the public web. It contains 736,000+ real .docx files collected from Common Crawl, validated, deduplicated, and labeled by document type and topic. It is the first large-scale research dataset for DOCX, the world's most-used document creation format, which has historically been absent from document AI work built on scanned images and PDFs.

Each document has a content hash, file size, detected language, and a type and topic label from a fine-tuned XLM-RoBERTa text classifier. The corpus spans 46+ languages and uses a 10-type, 9-topic taxonomy. Available as a HuggingFace dataset, a REST API at api.docxcorp.us, and downloadable manifest files. Built by SuperDoc. Source on GitHub. Dataset is ODC-BY; source code is MIT.

any
How to download

Use the filters above to select a subset, then click Download manifest to get a text file with one URL per line.

# Download all files in the manifest
wget -i manifest.txt -P ./corpus/

# Or with curl
xargs -n 1 curl -O < manifest.txt

# Or fetch directly via the API
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt
Document Type Topic Lang Confidence
document.docx
Powered by SuperDoc — DOCX editing and tooling Download
Loading document…