Access

Two ways to pull docx-corpus

HuggingFace for the metadata Parquet. R2 manifest for bulk .docx files. No signup, no API key.

Pick by use case

You want…	Use
The full dataset row by row (id, type, topic, lang, conf, url, word_count)	HuggingFace
Raw .docx files in bulk, filtered by type/topic/language	Manifest + R2
Just the extracted plain text	Extracted text
One specific document	`docxcorp.us/documents/{id}.docx`

HuggingFace

The full classified set as a single Parquet dataset. One row per document, with the metadata you'd use for training or filtering. The url column points at the raw .docx on R2 if you want to pair metadata with the original file.

from datasets import load_dataset

ds = load_dataset("superdoc-dev/docx-corpus", split="train")
print(ds[0])
# {'id': '...', 'filename': '...', 'type': 'legal',
#  'topic': 'government', 'language': 'en',
#  'word_count': 1234, 'confidence': 0.94,
#  'url': 'https://docxcorp.us/documents/{id}.docx'}

License: ODC-BY 1.0. Cite as superdoc-dev/docx-corpus.

Manifest + R2

Files live on Cloudflare R2 at docxcorp.us/documents/{id}.docx, content-addressed by SHA-256. The manifest endpoint returns a newline-delimited list of those URLs, filtered by type/topic/language/confidence. Pipe it into wget or curl for bulk transfer.

# Manifest of English legal documents at 0.8+ confidence
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt

# Bulk download (parallel 4)
xargs -n 1 -P 4 -a manifest.txt wget -q -P ./corpus/

# Full classified set (~736K URLs, ~73 MB text file)
curl "https://api.docxcorp.us/manifest" -o all.txt

Manifest filter params: type, topic, lang (ISO 639-1), min_confidence (0.0 to 1.0). Capped at 2M URLs per response.

If you already have the HuggingFace dataset and just want files for a subset you've filtered in Python, skip the manifest and use the url column directly.

Extracted text

Plain-text extraction lives alongside the raw files at docxcorp.us/extracted/{id}.txt. Useful when you only need the text and don't want to parse OOXML yourself.

import requests

text = requests.get(f"https://docxcorp.us/extracted/{doc_id}.txt").text

Both raw .docx and extracted text return X-Robots-Tag: noindex so they don't compete with the corpus pages in search.

Rate limits

R2 fronting is unmetered for normal research workloads. For full-corpus pulls, prefer the HuggingFace Parquet (one download) over hitting R2 several hundred thousand times. For scheduled pipelines that fetch heavily, email [email protected] first so we can plan capacity.

For interactive browsing (clicking through types/topics/filters in your browser), use the homepage explorer, which is backed by the same data.

/dataset /classification /quality