.docx files from the public web, classified into document types and topics across dozens of languages.
Built by 🦋 SuperDoc
Use the filters above to select a subset, then click Download manifest to get a text file with one URL per line.
# Download all files in the manifest
wget -i manifest.txt -P ./corpus/
# Or with curl
xargs -n 1 curl -O < manifest.txt
# Or fetch directly via the API
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt
| Document | Type | Topic | Lang | Confidence |
|---|