Evaluation-led agentic AI for operations teams

Ship AI automations you can actually measure.

We help professional services and operations teams turn repetitive workflows into measurable agentic AI automations. Every engagement starts with a paid assessment, runs against an evaluation dataset, and ships with a governance-ready baseline you can defend to a partner, a CFO, or an auditor.

Paid assessment first

$5,000 · 10 business days · yours to keep whether or not you continue with us.

Evaluation gates

Domain eval dataset authored before any code. Pass threshold required before real users.

Governance built in

Audit trails, prompt versioning, cost budgets, and human approval gates on every agentic workflow.

Monthly Impact Report

Hours returned, cost per task, quality scores — the numbers, in writing, every month.

acme · eval · northwind-rag-v2

$ run eval --dataset northwind-golden-v3

▸ Loading 47 golden examples…

▸ Running faithfulness scorer…

▸ Running hallucination probe…

✓ faithfulness 0.94 / 0.90 threshold

✓ hallucination 1.1% / 2.0% max

✓ citation_cov 0.97

EVAL GATE PASSEDcleared to ship

Apply for an assessment →Chat with our AI Companion Book a discovery call See the methodology end-to-end

New firm, transparent positioning: we publish our methodology and an end-to-end walkthrough on a sample engagement instead of dressed-up case study metrics. See /trust →

Status quo

What you’re probably doing now

✕Manual document review and classification
✕Inconsistent decisions across staff
✕ChatGPT usage no one can audit
✕Bottlenecked approvals and status updates
✕AI pilots that never got to a measurable outcome

Our approach

What working with us looks like

✓Assessment first: scorecard, opportunity register, ROI model
✓Eval dataset written before any code, gates on every pilot
✓RAG with citations, not answer-only black boxes
✓Approval workflows with an audit trail and a human in the loop
✓Governance-ready: prompt versioning, cost budgets, loop guards on every agentic workflow
✓Monthly Impact Report with the real numbers

Our commitments, in numbers

10 days

Assessment turnaround

Fixed scope, no surprises

≥90%

Faithfulness threshold

On every RAG pilot

100%

Pilots with golden evals

Written before any code

30 days

First measurable outcome

Or we say so upfront

Service catalog

Fixed scope. Fixed price. Fixed timeline.

Every pilot ships with an eval dataset and a written acceptance bar — not a vibes-based demo.

AssessFixed-scope, no commitment beyond the assessment itself

AI Readiness Assessment

$5,000 · 10 business days

Scorecard across 7 readiness dimensions (data, workflows, risk, talent, governance, infrastructure, sponsorship). Opportunity register with ROI estimates. 1–6 month roadmap you can act on with or without us.

Learn more →

Healthcare Scoping & BAA Kit

$10,000 · 14 business days

BAA-aware scoping for HIPAA-regulated workflows: data-handling profile, allowlisted models, redaction posture, and an implementation-ready compliance pack.

Learn more →

Evaluation & Red-Team Audit

$15,000 · 21 business days

Independent eval of an existing AI system: golden dataset, jailbreak / prompt-injection / PII probes, and a remediation plan with regression gates.

Learn more →

BuildProduction pilots with eval gates and written acceptance bars

Voice Intake Pilot

$20,000 · 4 weeks

Structured intake from inbound calls. Transcripts, extracted fields, and auto-created tickets with human review.

Learn more →

Document Intelligence Pilot

$25,000 · 4 weeks

RAG-backed assistant over your own documents. Eval dataset, faithfulness/precision gates, and citations on every answer.

Learn more →

Decision Support Pilot

$30,000 · 5 weeks

Source-backed recommendations and executive briefings from structured + document data. Every output traces back to its source.

Learn more →

Support Automation Pilot

$30,000 · 5 weeks

Tier-1 deflection with operator-assisted routing. Containment, escalation accuracy, and cost-per-ticket reported weekly.

Learn more →

Workflow Automation Pilot

$35,000 · 6 weeks

Multi-step state machine with approval gates and integrations. Replaces a repeating ops workflow end-to-end.

Learn more →

Multi-Agent Workflow Pilot

$60,000 · 8 weeks

Supervisor-routed multi-agent system with eval gates, cost budgets, and observable handoffs. For work a single workflow can't bound.

Learn more →

OperateMonthly retainers to keep your workflows performing

Ops Retainer — Small

$5,000/mo · Monthly

Light advisory, monitoring, monthly impact report. Right-sized for a single live workflow under steady load.

Learn more →

Ops Retainer — Mid

$10,000/mo · Monthly

SLA-tier support, optimization, quarterly business review materials. For multiple workflows or higher-stakes adoption.

Learn more →

Ops Retainer — Large

$20,000/mo · Monthly

Priority response, multiple workflows, regulated or higher-stakes support with named on-call.

Learn more →

Process

How an engagement works

Assess

10-day paid assessment. Scorecard across 7 readiness dimensions, opportunity register with ROI, 1–6 month roadmap. Yours to keep whether or not you continue with us.

Build & prove

Fixed-scope pilot with an eval dataset authored on day one. We ship a measurable baseline in 30 days behind evaluation gates.

Operate & grow

Ops retainer runs the automation for you. Monthly Impact Report with quality, cost, and hours-returned numbers.

→ See the full 10-step method

Under the hood

We use the same tools we build for clients

Our own delivery system runs on evaluation datasets, Langfuse tracing, and prompt versioning. Every engagement is measured the same way — because we built the measurement layer first.

✓Eval gates on every pilot before it ships
✓Langfuse tracing across all LLM and embedding calls
✓Prompt versioning with per-version cost attribution
✓Human approval gates on agentic workflows

See our Trust Center →

eval_config.yaml

eval:

dataset: northwind-golden-v3

faithfulness_threshold: 0.90

hallucination_max: 0.02

human_review_gate: required

citation_coverage_min: 0.90

governance:

prompt_versioning: enabled

loop_max_iterations: 12

cost_budget_usd: 50

audit_trail: required

Industries we serve

Specialist profiles for regulated industries

Our general delivery profile covers most operations teams. For regulated industries with specific compliance requirements, we offer dedicated profiles with the controls a CISO or compliance lead needs to sign off.

Profile available

Legal services

Contract review, matter summarisation, legal research assistance, and compliance-aware RAG — with citation gates and human-in-the-loop approval on any client-facing output.

See the profile →

Profile available

Healthcare operations

BAA-ready, PHI-aware delivery for healthcare workflows that require HIPAA-aligned data handling, allowlisted models, and documented incident response.

See the profile →

Profile available

Financial services

Stronger audit controls, access governance, and retention policies for finance and receivables workflows — AP/AR automation, decision support, and compliance document processing.

See the profile →

About the firm

Built by practitioners, for operations teams

We publish our methodology, pricing, and acceptance criteria upfront — so you can evaluate how we work before any conversation about money. No six-week sales cycle to get a number.

Our approach comes from direct experience building AI systems in operations-intensive environments, where the question is never “can AI do this?” but “how do we prove it’s working and keep it that way?”

About the firm →

Why buyers choose a specialist new firm

✓Direct access to the practitioner building your system — not a junior subcontractor behind a partner relationship

✓Pricing published upfront: no six-week sales cycle to get a number

✓Written acceptance criteria on every pilot — the accountability that larger generalist firms resist because it limits scope expansion

For consulting firms

New firm sign-up

This platform is for consulting firmsthat want their own branded workspace — not for potential clients. If you're looking to work with a consulting firm, use the intake form or chat with our Companion above.

No credit card · 14-day free trial · your own subdomain and branding

Create your firm's workspace →

Already have a workspace? Sign in

Next step

Start with a paid AI Readiness Assessment

10 business days. $5,000. A scorecard, a prioritized roadmap, and a clear next step — regardless of whether you continue with us.

Already have a stalled AI pilot? The assessment diagnoses what went wrong and what it would take to ship.

Apply for an assessment →Chat with our AI Companion Book a discovery call