Dataset card

What's in docx-corpus

Name: docx-corpus
Creator: SuperDoc
License: https://opendatacommons.org/licenses/by/1-0/

Schema, coverage, and access methods. Real numbers from api.docxcorp.us/stats + direct DB queries.

Counts · 736,242 classified

Bucket	Count
Total uploaded	1,101,537
Classified (type + topic + language)	736,242
Extracted, awaiting classification	267,539
Uploaded, awaiting extraction	93,440
Extracted with empty text	4,316
Duplicate (cross-crawl)	241,993
Failed (WARC or invalid .docx)	117,862

The classified subset is what the browser, API, and HuggingFace dataset expose by default. The pipeline-backlog buckets move into the classified set as extraction and classification catch up. Failed, duplicate, and empty-text rows stay in the database for accounting and do not reach the classified set under the current pipeline.

Per-document schema

Field	Type	Description
id	string	SHA-256 of raw .docx bytes; also the R2 storage key
filename	string	Original filename from source URL
type	enum (10)	Form/structure label, see classification
topic	enum (9)	Subject domain label
language	ISO 639-1	Detected by lingua, 76 distinct values
word_count	int	Median 566, mean 2,795
confidence	float	`min(type_conf, topic_conf)`, see quality
url	string	`https://docxcorp.us/documents/{id}.docx`

Extracted text available at https://docxcorp.us/extracted/{id}.txt. Raw .docx and extracted text both return X-Robots-Tag: noindex.

Access

HuggingFace superdoc-dev/docx-corpus · Parquet, bulk download

REST API api.docxcorp.us · faceted queries

Manifest api.docxcorp.us/manifest · URL list for wget/curl

Per document docxcorp.us/documents/{id}.docx

See /download for code examples.

License

Dataset metadata: ODC-BY 1.0. Pipeline source: MIT. Individual document copyright remains with the original author. Takedown: [email protected].

Citing

docx-corpus (2026). Open corpus of classified Word documents from the public web. https://docxcorp.us. Built by SuperDoc.

/classification /quality /download