the adjusted quality-outcome (aqo): a transaction-anchored, quality-normalized valuation metric for the outputs of artificial and human labor
raw price-per-task is a quality-blind number: a cheap output that fails is not cheaper than an expensive output that works. we present the adjusted quality-outcome (aqo), the transaction-level metric that the workforce labor index aggregates into its published figures. for a transaction j, aqo is the effective cost divided by the product of a provider eval score and a realized outcome score — the currency required per quality-adjusted unit of output. the eval score is supplied by a sealed, versioned task bank scored by three reviewers at cohen’s κ ≥ 0.85; the outcome score is the per-transaction realization. low quality raises aqo; failures leave the headline and enter a separate failure sub-index.
transaction-level aqo is aggregated to a headline category figure by a volume-weighted median over a rolling window, stabilized at the provider level by bayesian shrinkage toward the category prior, admitted under a four-tier data-quality scheme, bounded by a bootstrap confidence interval with explicit status thresholds, and normalized across capability cohorts so that successive model generations do not contaminate one another. the metric is defined only over commodifiable outputs; a four-tier measurability scope places human judgment permanently out of index. every figure in this document is illustrative sample or test-fixture data and is labeled as such; the methodology, rubric, sealed banks, and reference calculator are released under cc-by-4.0.
1.introduction
the market for artificial-labor outputs is priced today in raw dollars per task. that number is quality-blind: it cannot distinguish a $2 customer-support reply that resolves the ticket from a $2 reply that escalates it, nor a $40 brief that wins from a $40 brief that is rejected. a benchmark that anchors on raw price inherits this blindness and rewards the cheapest failure.
the adjusted quality-outcome (aqo) is the unit that removes it. aqo asks not “what did this output cost” but “what did a unit of working output cost,” by dividing transaction cost by two multiplicative quality factors — an independent eval score and a realized outcome score. the construction mirrors the kelley blue book move from sticker price to condition-adjusted value[1]: the headline is a quality-normalized price, not a list price.
aqo is the per-transaction substrate of the workforce labor index (wli)[2]; this paper is the companion to the wli methodology[2] and specifies the metric the index aggregates.
1.1 contributions
this document makes four contributions: (i) a transaction-level, quality-normalized valuation metric with explicit handling of failure and undefined cases; (ii) a volume-weighted headline estimator stabilized by bayesian shrinkage toward a category prior; (iii) a capability-cohort normalization that prevents successive model generations from contaminating a single figure; and (iv) a four-tier measurability scope, a data-tier admission scheme, and a public correction protocol that together bound what aqo may and may not claim.
1.2 foundations
the institution that publishes aqo is organized under the interpretable context methodology (icm), in which folder structure is the agentic architecture[9]. the full icm paper — van clief & mcdermott (2026) — is the architectural substrate beneath the eval, estimation, and audit machinery described here, and is available in full below.
2.notation
table 1 fixes the symbols used throughout. subscript j indexes a single admitted transaction; c indexes a task category; i (or π) indexes a provider; w indexes a weekly close.
| symbol | meaning | range / unit |
|---|---|---|
| cj | effective normalized cost of transaction j, deflated to the 2026 baseline | usd |
| eπ,c | provider eval score in category c (sealed bank) | 0–1 |
| sj | realized outcome score for transaction j | 0–1 |
| AQOj | adjusted quality-outcome of transaction j | usd / quality-adj. unit |
| vj | output-unit volume of transaction j | count |
| τj | tier credibility weight (table 3) | 0–1 |
| HAQOc | headline category aqo | usd / quality-adj. unit |
| κc | shrinkage prior strength (default 30) | transactions |
| ni | provider attested transaction count | count |
| CapIdxC | capability-cohort index (baseline 1.0 at 2026-01-01) | ratio |
3.the aqo metric
the adjusted quality-outcome of a single admitted transaction is its effective cost divided by the product of the provider’s eval score and the transaction’s realized outcome score:
the denominator is a quality multiplier in [0,1]. perfect quality (e = s = 1) leaves aqo equal to raw cost; any shortfall in either factor inflates aqo above cost, expressing the true price of a working unit. the unit of aqo is therefore currency per quality-adjusted unit of output.
3.1 undefined and failure cases
if the denominator is zero, aqo is undefined — it is not reported as a large finite number. a failed transaction (sj = 0) is routed to the failure-rate sub-index, not to the headline; this prevents a division-by-zero artifact from masquerading as an expensive success and keeps the failure signal separate and visible.
3.2 why two multiplicative factors
the eval score e is a provider-level capability measured independently on a sealed bank; the outcome score s is a transaction-level realization on the actual job. a provider can be broadly capable (high e) yet miss a specific job (low s), or vice versa. multiplying them, rather than averaging, means either failing alone is sufficient to inflate the cost of a working unit — the conservative behavior a price benchmark should have.
4.the eval score
the eval score eπ,c is what makes aqo quality-normalized rather than self-reported. it is produced by the workforce eval: a sealed, versioned task bank for category c, scored by three independent reviewers whose agreement must reach cohen’s κ ≥ 0.85[5] for the result to be admitted. the score is per-provider, per-category, and re-estimated monthly for ai providers and quarterly for human providers.
because the bank is sealed and versioned, an eval score is reproducible: the same provider, the same bank version, and the same rubric reproduce the score within review tolerance. the eval is also the admission gate for transaction data (§7) and the surface from which any builder obtains a verified aqo score and badge.
5.headline aggregation
the headline category figure for category c at time t is the volume-weighted median of the admitted transaction-level aqo values over a rolling window W (default 90 days, tier 1):
each transaction’s weight is its output-unit volume vj multiplied by its tier credibility weight τj (§7). the median, not the mean, is the central estimator for the same reason the new york federal reserve uses a volume-weighted median for sofr[3]: a single large contract cannot drag the headline. the headline is robust to the tails by construction, and the tails are where both data errors and outlier deals concentrate.
6.provider-level aqo & shrinkage
a provider with few transactions has a noisy own-estimate. rather than publish that noise or discard the provider, the provider-level figure is a bayesian shrinkage estimator[6] that blends the provider’s own aqo with the category headline in proportion to how much data the provider has:
κc is the prior strength, default 30 transactions. a brand-new provider (ni → 0) shrinks fully to the category mean; an established provider (ni ≫ κc) is governed by its own data. shrinkage is also the steady-state reality-contact mechanism: it keeps provider figures honest before enough evidence exists to stand alone.
7.data tiers & admission
not all evidence is equal. each transaction carries a tier credibility weight τj set by its source. table 2 (the lower-cased mapping of the data-tier scheme) gives the weights.
| tier | source type | τ |
|---|---|---|
| a | workforce marketplace transactions; signed vendor feeds | 1.0 |
| b | github merged prs; upwork verified profiles; bls oews; sam.gov contracts; levels.fyi; apex | 0.7 |
| c | vendor pricing pages; stack overflow survey; glassdoor | 0.1–0.3 |
| d | analyst reports; blog posts; case studies | 0.0 (context only) |
7.1 tier a admission
a candidate transaction is admitted as tier a only after all of: (i) the signature validates; (ii) all required fields are present and conformant to the published spec; (iii) the buyer-confirmation marker is present and non-revoked; and (iv) the output passes three-reviewer scoring at cohen’s κ ≥ 0.85[5]. public-source tiers (b–d) are admitted under hiQ Labs v. LinkedIn[8]: public pages only, full robots.txt compliance, source attribution on every row, no fake accounts, no logged-in scraping.
8.uncertainty quantification
a headline figure is never published as a bare point estimate. the confidence interval is computed by bootstrap[7] — B = 10,000 resamples drawn with replacement from the weighted transaction set — and the 80% interval is the [10th, 90th] percentile of the resampled HAQO values:
8.1 status thresholds
the ci ratio — upper bound divided by lower bound — is the published measure of how trustworthy a figure is, and it drives a discrete status label. table 3 gives the cascade.
| status | condition (ci ratio = upper / lower) |
|---|---|
| live | ≤ 1.5 |
| preview | ≤ 2.5 |
| thin_data | ≤ 4.0 |
| below_threshold | none of the above pass |
9.capability-cohort adjustment
a gpt-5-class agent and a gpt-7-class agent should not compete directly inside one headline. the cohort-adjusted figure divides the headline by a capability index normalized to 1.0 at the baseline cohort and epoch (2026-01-01):
CapIdx is derived from the apex-agents pass@1 leaderboard. the adjustment lets the index track the price of a task while holding capability fixed, so that a falling headline reflects a cheaper market rather than merely a stronger model generation.
10.measurability scope
aqo is defined only over commodifiable outputs. a four-tier scope fixes what is indexed and what never will be.
| tier | what is indexed | status |
|---|---|---|
| 1 | binary outputs: cs resolution, code merge, lead qualified, ticket closed | ships v1 |
| 2 | continuous quality: consulting deliverables, legal briefs, financial models | v2–v3 (18–36 mo) |
| 3 | proxy metrics: therapy nps, r&d attribution | 5–10 yr roadmap |
| 4 | ceo decisions, board governance, creative direction, relationship sales, crisis judgment | permanently out of scope |
11.teams (interim policy)
a team transaction produces a bundled output (e.g. a coding agent + reviewer + deployer shipping a feature) that has no clean single-category aqo destination. until the hybrid index ships, team transactions follow a hold-and-backfill policy: all are captured into the same pipeline; none contaminate any single category’s headline; aggregate counts, gmv, and category mix are published in a separate “teams — awaiting hybrid” view; and on hybrid v1 all held transactions are backfilled with a documented ledger entry. sellers see their own provisional figure with an explicit “hybrid in development” disclosure. holding is recoverable; force-fitting or silently excluding is not.
12.worked example
two providers, sam and bob, each complete a transaction priced at c = 10 in the same category. sam is the stronger provider; bob is weaker. equation (1) applies directly.
| provider | cost c | eval e | outcome s | aqo = c / (e·s) |
|---|---|---|---|---|
| sam | 10 | 0.9 | 0.9 | 12.35 |
| bob | 10 | 0.5 | 0.5 | 40.00 |
the example shows the core point: equal raw price, a 3.2× gap in aqo. a benchmark on raw price would call these identical; aqo separates them, and the volume-weighted median lets the higher-evidence provider govern the headline.
13.corrections & governance
the workforce labor index editorial board governs this methodology; board composition and conflict-of-interest disclosures are public, and changes to formula constants, tier weights, the shrinkage prior, or category boundaries require a public comment period, a board vote, and a methodology version increment that produces a new citation.
when a published figure is found to be wrong, the correction protocol forbids silent retraction and backdated correction. the affected figure is flagged but kept visible; a root-cause record is filed; the figure is recomputed with both versions shown for 30 days; and if the change is material or the figure was cited externally, every known external citation is named in the record. corrections never produce a bare point estimate — the uncertainty band is preserved across the error event.
14.reproducibility & availability
reproducibility is a published standard: an independent computation, given the same admitted corpus and the same methodology version, must reproduce the figure within one ci half-width. the reference implementation is an open calculator with a math-trace for every intermediate step, validated by automated suites — the core, round-two, teams/workflows, icm-workflow, and industry-benchmark suites together assert several hundred checks against the formulae in this paper.
the metric definition, the eval rubric, the sealed banks, the tier scheme, the status thresholds, and the reference calculator are released under cc-by-4.0. this is a v1.0 working draft published ahead of the first verified transactions; the figures shown throughout are illustrative sample or test-fixture data and are labeled as such.