workforce · /for/procurement · vendor evaluation framework

AI agent vendor evaluation framework

Seven points every AI agent vendor should be measured against before procurement. Built around independent quality verification (AQO), benchmarked pricing (WLI), and the security and data-sovereignty terms that determine whether a pilot survives legal review.

→ the seven points → rfp template → wli market rates → methodology

— the seven points —

1price transparency 2quality verification 3outcome measurement 4methodology disclosure 5security and soc 2 6refund and sla policy 7data sovereignty

price transparency

A vendor that will not publish per-unit price (per resolution, per PR, per document) in the unit of work for the task is not ready for procurement. "Contact sales" tiers, undisclosed pass-through costs, and unbounded annual escalators all disqualify a vendor at this stage. Compare the vendor’s per-unit price against the published WorkForce Labor Index rate for the category to know whether you are paying a market rate, a discount, or a premium.

→ wli market rates → ai agent cost calculator

quality verification

Vendor-published quality claims are not evidence. Require an independent AQO score from the current eval bank version, or a sealed holdout evaluation run by the buyer during the pilot. If the vendor will not submit to an independent quality measurement, the quality claim should be discarded.

→ aqo definition → free eval

outcome measurement

How will outcomes be measured in production — and who owns the measurement infrastructure? A vendor that measures its own quality and reports it to the buyer holds both ends of the contract. The buyer must own (or independently verify) the measurement pipeline for the duration of the agreement.

→ methodology v1.0

methodology disclosure

Does the vendor publish how the agent works, what data it was trained on, how outputs are evaluated, and what failure modes are known? Black-box pitches should be downweighted regardless of demo quality. A vendor that cannot disclose methodology cannot be benchmarked, cannot be audited, and cannot be switched out without re-discovering all of these properties on the next vendor.

→ workforce methodology

security and soc 2

Require a current SOC 2 Type II report. ISO 27001 / ISO 27701 where applicable. Sub-processor list with change-notification window. Incident response window in hours. Customer-managed keys, SSO/SAML, RBAC, audit logs. For regulated industries (finance, healthcare, legal), add the relevant vertical attestations.

refund and sla policy

Output quality belongs in the SLA — as a percentile target, not an average. Specify AQO floor for the contract term and remedy mechanics: service credits, refund, cure window, termination right. Specify whether failed per-unit outputs are charged. A vendor unwilling to put quality into the SLA is selling an experiment.

→ rfp template (sla schedule)

data sovereignty

Where (region, country) is customer data stored, processed, and backed up? Can it be pinned to a single region? What is the default training-on-customer-data posture, and is the override contractually enforceable? How long is data retained, in what form, and how is deletion verified? Data sovereignty is the single most common source of late-stage procurement rejection.

this framework is benchmarked against the workforce labor index

Point 1 (price transparency) is meaningless without a published market rate to compare to. The WLI publishes per-category transaction-anchored rates with confidence intervals. Point 2 (quality verification) requires an independent AQO score, computed against the WorkForce eval bank.

→ workforce labor index → wli methodology v1.0 → aqo definition → ai agent cost calculator → rfp template

— faq —

questions about evaluating ai vendors

What are the most important criteria for evaluating AI agent vendors?

Price transparency, independent quality verification (AQO score), outcome measurement ownership, methodology disclosure, SOC 2 Type II, refund and SLA mechanics, and data sovereignty. The WorkForce framework treats these as seven non-negotiable points — a vendor that fails any one of them is not ready for production.

How do we benchmark vendor pricing?

For categories with a published WorkForce Labor Index rate, compare the vendor’s per-unit price (per resolution, per PR, per document) against the published median and confidence interval at /wli/[category]. Material discounts or premiums must be justified by AQO score — a cheap agent with a poor AQO is more expensive per quality-adjusted unit than a premium agent with a strong AQO.

Are vendor-published quality scores acceptable?

No. Vendor-published quality is a claim, not evidence. Require an AQO score from an independent eval bank (the WorkForce free eval is one option) or reserve the right to run a sealed holdout during the pilot. If a vendor will not submit to independent measurement, treat the quality claim as unverified.

What is AQO?

AQO (Agent Quality of Output) is a quality score for AI labor, defined by WorkForce and computed against a sealed, versioned eval bank per task category. See /methodology/aqo for the full definition.