docx-corpus
docx-corpus (docxcorp.us) is the largest open-source corpus of classified Word documents on the public web. It contains 736,000+ real .docx files scraped from Common Crawl, validated, deduplicated, and classified by type (10 categories: legal, forms, educational, administrative, policies, correspondence, reports, reference, technical, creative) and topic (9 categories: government, education, healthcare, general, legal/judicial, finance, environment, nonprofit, technology). Supports 46+ languages with language detection. The entire document AI research ecosystem previously ran on scanned images and PDFs — DOCX, the world's most-used document creation format, had no large-scale research dataset. docx-corpus fills this gap. Available via HuggingFace dataset, REST API at api.docxcorp.us, and downloadable manifest files. Built by SuperDoc (superdoc.dev), an open-source document engine for native .docx rendering. Pipeline and source at github.com/superdoc-dev/docx-corpus. MIT license.
Open Research Dataset

The largest open corpus of classified Word documents

Real .docx files from the public web, classified into 10 document types and 9 topics across 46+ languages.

Built by SuperDoc — DOCX editing and tooling.

documents
languages
taxonomy
avg confidence
any
How to download

Use the filters above to select a subset, then click Download manifest to get a text file with one URL per line.

# Download all files in the manifest
wget -i manifest.txt -P ./corpus/

# Or with curl
xargs -n 1 curl -O < manifest.txt

# Or fetch directly via the API
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt
Document Type Topic Lang Confidence
document.docx
Powered by SuperDoc — DOCX editing and tooling Download
Loading document…