2026 - Colabra

Evals are product strategy

A product-led view of AI evals: define what good means for the user, the workflow, the evidence, the launch gate, and the business.

In August 2023, Colabra's AI assistant had the problem every useful LLM product eventually has.

It worked well enough to be interesting. It also made things up.

The fix that mattered most at that stage was simple: never answer with information outside the provided context. That one instruction sounds small. It changed the product promise. The assistant was no longer trying to be impressive. It was trying to be grounded.

That is the moment I started thinking about evals differently.

An eval is a product promise written in measurable form.

It tells the company what the product is allowed to do, what failure costs, and what "good" means in the user's world.

The wrong eval makes the wrong product better

If you evaluate a diligence product by whether the answer sounds polished, you will build a polished liability.

The answer may be fluent. The issue may be unsupported. The citation may point to the wrong clause. The model may miss the most important document. The user may accept the output without opening the source. The product may create speed and false confidence at the same time.

That is the trap.

AI teams often start by asking if the model is good. Product teams need to ask if the workflow is safer, faster, and more useful when the model participates.

Those are different questions.

The eval stack starts with the job

For a diligence workflow, I would evaluate five layers.

First, task success. Did the product answer the request-list question, classify the document, identify the missing file, or produce the issue row the user needed?

Second, evidence quality. Did the citation support the claim? Did the answer distinguish source evidence from inference? Could a reviewer open the document and understand why the AI said what it said?

Third, risk calibration. Did the system catch material issues without calling every standard clause a red flag? A product that treats boilerplate as catastrophic teaches users to ignore it.

Fourth, workflow quality. Did the output fit the way legal, finance, HR, IT, tax, and corp-dev already review a deal? Did it reduce time to first useful issue list? Did it preserve review state and specialist ownership?

Fifth, operating economics. Did the system complete the job at a cost, latency, and failure rate the business can support?

That is the eval. The model score is only one input.

Evals decide what you can safely launch

The launch question in AI should rarely be "is the model ready?"

The better question is "which part of the workflow is ready for this level of automation?"

Document classification may be safe to launch earlier. Request-list mapping may be safe with review. Material risk conclusions need citations, human ownership, and tighter thresholds. Anything that could create legal, financial, privacy, or reputational harm needs a higher bar.

That risk-tiered launch plan is product strategy. It lets the team move without pretending every task has the same failure cost.

The fastest path is often a narrower launch with a real trust model.

The best evals make engineers faster

Weak evals produce arguments. Strong evals produce hill-climbing.

"The answer was bad" is not useful.

"The retrieval step missed the amendment, the citation pointed to the MSA instead of the SOW, and the severity label over-weighted standard vendor language" is useful.

That distinction matters because AI product failures usually sit in layers. Retrieval can fail. Classification can fail. Extraction can fail. The prompt can fail. The model can over-infer. The output format can fail. The UX can fail by making unsupported claims look too final.

Engineers need error categories they can act on. Product needs those same categories because they reveal where the user is losing trust.

Adoption is also an eval

A product team can fool itself with model metrics and still lose the customer.

If users do not come back after the first room, the product failed. If specialists reject most AI-generated issues, the product failed. If reviewers accept high-risk conclusions without opening sources, the product may be creating a hidden safety problem. If buyers like the demo and decline the pilot, the proof design failed.

This is why I would track both model behavior and adoption behavior:

citation accuracy
unsupported-claim rate
critical issue recall
false-critical flag rate
specialist override rate
correction rate
time to first useful issue list
repeat usage by workflow
blind acceptance of high-risk outputs
cost per successful task

Those metrics belong together. They describe whether the product is becoming trusted work or just generating text.

Evidence calibration is an internal eval

There is another eval that founders do not talk about enough: can the team classify its own customer evidence?

A meeting can validate pain. A pilot can validate workflow. A pricing discussion can validate packaging. Procurement can validate seriousness and also slow everything down. An internal-build conversation can validate the category while threatening the startup. A lost paid proof can reveal the benchmark the product failed to beat.

Those are all useful. They are not the same.

If a team flattens every promising conversation into "traction," it is failing its own eval. The product may look stronger in the update and weaker in reality.

I wrote more about that in Stop Calling Every Customer Conversation Traction. The principle belongs here too. Measurement belongs to company narratives as much as model outputs.

What I would put in the PRD

For any serious AI feature, I would include an eval section before the launch plan.

It should answer:

What user decision does this feature improve?
What source evidence should support the output?
What failure modes are severe enough to block launch?
Which failures can be corrected by a reviewer?
What regression set will survive model upgrades?
What production signals tell us the workflow is earning trust?
What cost and latency thresholds make the product viable?

That section is not paperwork. It is the product.

The eval defines the promise. The product either keeps it or it does not.