Open Research Dataset

The largest open DOCX dataset and document corpus.

Name: docx-corpus
Creator: SuperDoc
License: https://opendatacommons.org/licenses/by/1-0/

Real .docx files from the public web, validated, deduplicated, and labeled by type, topic, and language.

Built by SuperDoc.

documents

languages

taxonomy

avg confidence

About the dataset

docx-corpus is the largest open corpus of classified Word documents on the public web. It contains 736,000+ real .docx files collected from Common Crawl, validated, deduplicated, and labeled by document type and topic. It is the first large-scale research dataset for DOCX, the world's most-used document creation format, which has historically been absent from document AI work built on scanned images and PDFs.

Each document has a content hash, file size, detected language, and a type and topic label from a fine-tuned XLM-RoBERTa text classifier. The corpus spans 46+ languages and uses a 10-type, 9-topic taxonomy. Available as a HuggingFace dataset, a REST API at api.docxcorp.us, and downloadable manifest files. Built by SuperDoc. Source on GitHub. Dataset is ODC-BY; source code is MIT.

Browse by type

Browse by topic

Document Types

Topics

Languages

Browse Documents

Min confidence: any

How to download

Use the filters above to select a subset, then click Download manifest to get a text file with one URL per line.

# Download all files in the manifest
wget -i manifest.txt -P ./corpus/

# Or with curl
xargs -n 1 curl -O < manifest.txt

# Or fetch directly via the API
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt

Document	Type	Topic	Lang	Confidence