Tool · Python
extract-cli
Ingest any contract and get structured JSON. Hand extract-cli
any document — yours or a counterparty's foreign paper, in .md /
.txt / .html / .docx /
.pdf — and it returns the parties, dates, term, governing law, a normalized
clause map, a defined-term inventory, and a headline value. Every field carries a
confidence and a source, so you
verify, don't trust. Stdlib-only, single-file, local-first.
It works standalone — and also composes with the contract-ops suite as its open-loop front door: where the rest of the pipeline handles documents it authored from its own templates, extract turns any foreign paper into the suite's canonical, structured vocabulary that review, lint, compare, and the vault consume.
pipx install extract-cli, then extract demo for a
zero-config tour on a bundled NDA. Point it at your own file with
extract counterparty.docx. Stdlib-only, no network on the default path;
--llm enrichment is opt-in.
Try it live
Paste a contract (or use the preset) and watch the deterministic tier pull parties, clauses, dates, and governing law — in a sandbox, no install.
Runs the real CLI on your input in a sandbox — no setup, nothing stored. Open in a new tab ↗
What it does
- One tool, five formats.
.md/.txt/.htmlare native;.docxand.pdfwork out of the box via stdlib readers, with optional[docx]/[pdf]extras for higher fidelity. HTML is auto-detected even inside a.txt(e.g. SEC EDGAR filings). - The clause map is the differentiator. A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's "## Confidentiality" are the same clause. extract-cli reuses template-vault-cli's clause-detection cascade (
## H2→ bold-numbered → ALL-CAPS) and a built-in canonical alias vocabulary to normalize foreign titles onto the names the rest of the suite speaks. Unmappable clauses are kept withmapped: false— nothing is silently dropped. - Two explicit tiers. The deterministic tier (always on, no network) extracts parties, dates, defined terms, the clause map, governing law, and best-effort term/notice/value. The llm tier is opt-in via
--llmonly — renewal mechanics, obligation phrasing, and a clause-map fallback when a document carries no detectable headings. - Verify, not trust. Each scalar is a
{value, confidence, source}envelope; "not found" is the canonical{value: null, confidence: 0.0, source: "none"}._metarecords which tiers ran and whether the LLM was used. - A good pipe citizen. JSON on stdout by default;
--why, warnings, and errors on stderr. Exit1on a low-signal document (scanned/empty) is a finding, not a crash — valid JSON is still emitted.
Quickstart
# Install (zero runtime deps; extras improve .docx/.pdf fidelity)
pipx install extract-cli
extract --version
# Zero-config: extract a bundled NDA and show the narrative
extract demo
# Your own document → structured JSON on stdout
extract counterparty.docx | jq '{parties: [.parties[].name],
governing_law: .governing_law.value, clauses: [.clauses[].canonical_title]}' For agents and automation
extract-cli answers the suite's discovery contract: extract --catalog json
returns {name, bin, version, description, commands[], exitCodes} so an
agent learns every command and flag at startup instead of hardcoding them. The output shape is locked
by a JSON Schema (extract schema, committed at
docs/spec/)
with a semver commitment: a backward-incompatible change requires a major version bump. The full agent
contract — output envelope, exit codes, discovery, and failure → recovery — lives in
AGENTS.md.
extract --catalog json # discover commands/flags (call at startup)
extract schema # the output JSON Schema (the data contract)
extract counterparty.pdf | jq -e '.clauses | all(.confidence > 0.7)' # gate on confidence
The integration contract is the output schema plus the shared canonical
clause vocabulary — extract's canonical_title values are the same
names template-vault-cli detects and
nda-review-cli keys policy on, so a foreign document's clauses line
up with the suite's with no bespoke adapter. See
docs/INTEROP.md.
The opt-in --llm tier reads the suite-wide
~/.config/contract-ops/llm.json, so configuring it once lights up LLM features
across every tool that adopts the same lookup order.
Where it fits in the workflow
The open-loop front door at the very start of the nine-CLI workflow. The closed loop — template-vault → draft → nda-review → compare → docx2pdf → sign — only handles documents authored from its own templates. extract-cli is how foreign paper gets in: turn a counterparty's contract into the suite's structured vocabulary, then hand it to nda-review for a policy verdict or to compare to align it against your canonical template. See the full workflow for the chained commands.
Repo
github.com/DrBaher/extract-cli · MIT licensed · Stdlib-only Python ([docx]/[pdf] extras add higher-fidelity readers) · PyPI-installable.