Skip to content
contract-ops CLI suite

Tool · Python

extract-cli

extract-cli v0.1.17 0 533/wk on npm Latest from PyPI · checked Sun, 07 Jun 2026

Ingest any contract and get structured JSON. Hand extract-cli any document — yours or a counterparty's foreign paper, in .md / .txt / .html / .docx / .pdf — and it returns the parties, dates, term, governing law, a normalized clause map, a defined-term inventory, and a headline value. Every field carries a confidence and a source, so you verify, don't trust. Stdlib-only, single-file, local-first.

It works standalone — and also composes with the contract-ops suite as its open-loop front door: where the rest of the pipeline handles documents it authored from its own templates, extract turns any foreign paper into the suite's canonical, structured vocabulary that review, lint, compare, and the vault consume.

TL;DR — pipx install extract-cli, then extract demo for a zero-config tour on a bundled NDA. Point it at your own file with extract counterparty.docx. Stdlib-only, no network on the default path; --llm enrichment is opt-in.

Try it live

Paste a contract (or use the preset) and watch the deterministic tier pull parties, clauses, dates, and governing law — in a sandbox, no install.

Runs the real CLI on your input in a sandbox — no setup, nothing stored. Open in a new tab ↗

What it does

  • One tool, five formats. .md/.txt/.html are native; .docx and .pdf work out of the box via stdlib readers, with optional [docx]/[pdf] extras for higher fidelity. HTML is auto-detected even inside a .txt (e.g. SEC EDGAR filings).
  • The clause map is the differentiator. A counterparty's "SECTION 7. NON-DISCLOSURE" and your template's "## Confidentiality" are the same clause. extract-cli reuses template-vault-cli's clause-detection cascade (## H2 → bold-numbered → ALL-CAPS) and a built-in canonical alias vocabulary to normalize foreign titles onto the names the rest of the suite speaks. Unmappable clauses are kept with mapped: false — nothing is silently dropped.
  • Two explicit tiers. The deterministic tier (always on, no network) extracts parties, dates, defined terms, the clause map, governing law, and best-effort term/notice/value. The llm tier is opt-in via --llm only — renewal mechanics, obligation phrasing, and a clause-map fallback when a document carries no detectable headings.
  • Verify, not trust. Each scalar is a {value, confidence, source} envelope; "not found" is the canonical {value: null, confidence: 0.0, source: "none"}. _meta records which tiers ran and whether the LLM was used.
  • A good pipe citizen. JSON on stdout by default; --why, warnings, and errors on stderr. Exit 1 on a low-signal document (scanned/empty) is a finding, not a crash — valid JSON is still emitted.

Quickstart

install + run
# Install (zero runtime deps; extras improve .docx/.pdf fidelity)
pipx install extract-cli
extract --version

# Zero-config: extract a bundled NDA and show the narrative
extract demo

# Your own document → structured JSON on stdout
extract counterparty.docx | jq '{parties: [.parties[].name],
  governing_law: .governing_law.value, clauses: [.clauses[].canonical_title]}'

For agents and automation

extract-cli answers the suite's discovery contract: extract --catalog json returns {name, bin, version, description, commands[], exitCodes} so an agent learns every command and flag at startup instead of hardcoding them. The output shape is locked by a JSON Schema (extract schema, committed at docs/spec/) with a semver commitment: a backward-incompatible change requires a major version bump. The full agent contract — output envelope, exit codes, discovery, and failure → recovery — lives in AGENTS.md.

recommended agent defaults
extract --catalog json            # discover commands/flags (call at startup)
extract schema                    # the output JSON Schema (the data contract)
extract counterparty.pdf | jq -e '.clauses | all(.confidence > 0.7)'   # gate on confidence

The integration contract is the output schema plus the shared canonical clause vocabulary — extract's canonical_title values are the same names template-vault-cli detects and nda-review-cli keys policy on, so a foreign document's clauses line up with the suite's with no bespoke adapter. See docs/INTEROP.md. The opt-in --llm tier reads the suite-wide ~/.config/contract-ops/llm.json, so configuring it once lights up LLM features across every tool that adopts the same lookup order.

Where it fits in the workflow

The open-loop front door at the very start of the nine-CLI workflow. The closed loop — template-vaultdraftnda-reviewcomparedocx2pdfsign — only handles documents authored from its own templates. extract-cli is how foreign paper gets in: turn a counterparty's contract into the suite's structured vocabulary, then hand it to nda-review for a policy verdict or to compare to align it against your canonical template. See the full workflow for the chained commands.

Repo

github.com/DrBaher/extract-cli · MIT licensed · Stdlib-only Python ([docx]/[pdf] extras add higher-fidelity readers) · PyPI-installable.

Edit this page on GitHub