docx-corpus
Open Dataset

The largest classified corpus of Word documents

.docx files from the public web, classified into document types and topics across dozens of languages.

Built by 🦋 SuperDoc

documents
languages
taxonomy
avg confidence
any
How to download

Use the filters above to select a subset, then click Download manifest to get a text file with one URL per line.

# Download all files in the manifest
wget -i manifest.txt -P ./corpus/

# Or with curl
xargs -n 1 curl -O < manifest.txt

# Or fetch directly via the API
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt
Document Type Topic Lang Confidence
document.docx
Powered by🦋 SuperDoc Download
Loading document…