Two ways to pull docx-corpus
HuggingFace for the metadata Parquet. R2 manifest for bulk .docx files. No signup, no API key.
| You want… | Use |
|---|---|
| The full dataset row by row (id, type, topic, lang, conf, url, word_count) | HuggingFace |
| Raw .docx files in bulk, filtered by type/topic/language | Manifest + R2 |
| Just the extracted plain text | Extracted text |
| One specific document | docxcorp.us/documents/{id}.docx |
The full classified set as a single Parquet dataset. One row per document, with the metadata you'd use for training or filtering. The url column points at the raw .docx on R2 if you want to pair metadata with the original file.
from datasets import load_dataset ds = load_dataset("superdoc-dev/docx-corpus", split="train") print(ds[0]) # {'id': '...', 'filename': '...', 'type': 'legal', # 'topic': 'government', 'language': 'en', # 'word_count': 1234, 'confidence': 0.94, # 'url': 'https://docxcorp.us/documents/{id}.docx'}
License: ODC-BY 1.0. Cite as superdoc-dev/docx-corpus.
Files live on Cloudflare R2 at docxcorp.us/documents/{id}.docx, content-addressed by SHA-256. The manifest endpoint returns a newline-delimited list of those URLs, filtered by type/topic/language/confidence. Pipe it into wget or curl for bulk transfer.
# Manifest of English legal documents at 0.8+ confidence curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt # Bulk download (parallel 4) xargs -n 1 -P 4 -a manifest.txt wget -q -P ./corpus/ # Full classified set (~736K URLs, ~73 MB text file) curl "https://api.docxcorp.us/manifest" -o all.txt
Manifest filter params: type, topic, lang (ISO 639-1), min_confidence (0.0 to 1.0). Capped at 2M URLs per response.
If you already have the HuggingFace dataset and just want files for a subset you've filtered in Python, skip the manifest and use the url column directly.
Plain-text extraction lives alongside the raw files at docxcorp.us/extracted/{id}.txt. Useful when you only need the text and don't want to parse OOXML yourself.
import requests
text = requests.get(f"https://docxcorp.us/extracted/{doc_id}.txt").text Both raw .docx and extracted text return X-Robots-Tag: noindex so they don't compete with the corpus pages in search.
R2 fronting is unmetered for normal research workloads. For full-corpus pulls, prefer the HuggingFace Parquet (one download) over hitting R2 several hundred thousand times. For scheduled pipelines that fetch heavily, email [email protected] first so we can plan capacity.
For interactive browsing (clicking through types/topics/filters in your browser), use the homepage explorer, which is backed by the same data.