companion paper · aqoworking draft · v1.0q-fin.gnarxiv: forthcomingdoi: forthcomingcc-by-4.0

pre-publicationaqo methodology v1.0working draftcc-by-4.0

the adjusted quality-outcome (aqo): a transaction-anchored, quality-normalized valuation metric for the outputs of artificial and human labor

the per-transaction unit beneath the workforce labor index — methodology, estimation, and governance · version 1.0 (working draft)

workforce labor index editorial board¹

¹workforce · methodology@workforce.griffain.com

companion to the wli methodology paper · doi: forthcoming · pre-publication working draft

— abstract —

raw price-per-task is a quality-blind number: a cheap output that fails is not cheaper than an expensive output that works. we present the adjusted quality-outcome (aqo), the transaction-level metric that the workforce labor index aggregates into its published figures. for a transaction j, aqo is the effective cost divided by the product of a provider eval score and a realized outcome score — the currency required per quality-adjusted unit of output. the eval score is supplied by a sealed, versioned task bank scored by three reviewers at cohen’s κ ≥ 0.85; the outcome score is the per-transaction realization. low quality raises aqo; failures leave the headline and enter a separate failure sub-index.

transaction-level aqo is aggregated to a headline category figure by a volume-weighted median over a rolling window, stabilized at the provider level by bayesian shrinkage toward the category prior, admitted under a four-tier data-quality scheme, bounded by a bootstrap confidence interval with explicit status thresholds, and normalized across capability cohorts so that successive model generations do not contaminate one another. the metric is defined only over commodifiable outputs; a four-tier measurability scope places human judgment permanently out of index. every figure in this document is illustrative sample or test-fixture data and is labeled as such; the methodology, rubric, sealed banks, and reference calculator are released under cc-by-4.0.

keywords: adjusted quality-outcome · quality-normalized pricing · eval score · volume-weighted median · bayesian shrinkage · bootstrap confidence intervals · capability cohort · measurability scope · transaction-anchored data

1.introduction

the market for artificial-labor outputs is priced today in raw dollars per task. that number is quality-blind: it cannot distinguish a $2 customer-support reply that resolves the ticket from a $2 reply that escalates it, nor a $40 brief that wins from a $40 brief that is rejected. a benchmark that anchors on raw price inherits this blindness and rewards the cheapest failure.

the adjusted quality-outcome (aqo) is the unit that removes it. aqo asks not “what did this output cost” but “what did a unit of working output cost,” by dividing transaction cost by two multiplicative quality factors — an independent eval score and a realized outcome score. the construction mirrors the kelley blue book move from sticker price to condition-adjusted value^[1]: the headline is a quality-normalized price, not a list price.

aqo is the per-transaction substrate of the workforce labor index (wli)^[2]; this paper is the companion to the wli methodology^[2] and specifies the metric the index aggregates.

1.1 contributions

this document makes four contributions: (i) a transaction-level, quality-normalized valuation metric with explicit handling of failure and undefined cases; (ii) a volume-weighted headline estimator stabilized by bayesian shrinkage toward a category prior; (iii) a capability-cohort normalization that prevents successive model generations from contaminating a single figure; and (iv) a four-tier measurability scope, a data-tier admission scheme, and a public correction protocol that together bound what aqo may and may not claim.

1.2 foundations

the institution that publishes aqo is organized under the interpretable context methodology (icm), in which folder structure is the agentic architecture^[9]. the full icm paper — van clief & mcdermott (2026) — is the architectural substrate beneath the eval, estimation, and audit machinery described here, and is available in full below.

2.notation

table 1 fixes the symbols used throughout. subscript j indexes a single admitted transaction; c indexes a task category; i (or π) indexes a provider; w indexes a weekly close.

table 1 · notation

symbol	meaning	range / unit
c_j	effective normalized cost of transaction j, deflated to the 2026 baseline	usd
e_π,c	provider eval score in category c (sealed bank)	0–1
s_j	realized outcome score for transaction j	0–1
AQO_j	adjusted quality-outcome of transaction j	usd / quality-adj. unit
v_j	output-unit volume of transaction j	count
τ_j	tier credibility weight (table 3)	0–1
HAQO_c	headline category aqo	usd / quality-adj. unit
κ_c	shrinkage prior strength (default 30)	transactions
n_i	provider attested transaction count	count
CapIdx_C	capability-cohort index (baseline 1.0 at 2026-01-01)	ratio

Tab. 1 · symbols and units. all costs are deflated to a fixed 2026 baseline so figures are comparable across time.

3.the aqo metric

the adjusted quality-outcome of a single admitted transaction is its effective cost divided by the product of the provider’s eval score and the transaction’s realized outcome score:

AQO_j = c_j / ( e_π,c × s_j )(1)

the denominator is a quality multiplier in [0,1]. perfect quality (e = s = 1) leaves aqo equal to raw cost; any shortfall in either factor inflates aqo above cost, expressing the true price of a working unit. the unit of aqo is therefore currency per quality-adjusted unit of output.

3.1 undefined and failure cases

if the denominator is zero, aqo is undefined — it is not reported as a large finite number. a failed transaction (s_j = 0) is routed to the failure-rate sub-index, not to the headline; this prevents a division-by-zero artifact from masquerading as an expensive success and keeps the failure signal separate and visible.

3.2 why two multiplicative factors

the eval score e is a provider-level capability measured independently on a sealed bank; the outcome score s is a transaction-level realization on the actual job. a provider can be broadly capable (high e) yet miss a specific job (low s), or vice versa. multiplying them, rather than averaging, means either failing alone is sufficient to inflate the cost of a working unit — the conservative behavior a price benchmark should have.

4.the eval score

the eval score e_π,c is what makes aqo quality-normalized rather than self-reported. it is produced by the workforce eval: a sealed, versioned task bank for category c, scored by three independent reviewers whose agreement must reach cohen’s κ ≥ 0.85^[5] for the result to be admitted. the score is per-provider, per-category, and re-estimated monthly for ai providers and quarterly for human providers.

because the bank is sealed and versioned, an eval score is reproducible: the same provider, the same bank version, and the same rubric reproduce the score within review tolerance. the eval is also the admission gate for transaction data (§7) and the surface from which any builder obtains a verified aqo score and badge.

relationship to the public “aqo score.” the builder-facing aqo score returned by a free eval is the same quantity, presented with its 95% confidence interval and the bank version it ran against, normalized against the category rate on the wli. it is a verifiable claim — auditable against the underlying evaluation — not a badge.

5.headline aggregation

the headline category figure for category c at time t is the volume-weighted median of the admitted transaction-level aqo values over a rolling window W (default 90 days, tier 1):

HAQO_c(t) = median_w( AQO_j : j ∈ W ), w_j = v_j × τ_j(2)

each transaction’s weight is its output-unit volume v_j multiplied by its tier credibility weight τ_j (§7). the median, not the mean, is the central estimator for the same reason the new york federal reserve uses a volume-weighted median for sofr^[3]: a single large contract cannot drag the headline. the headline is robust to the tails by construction, and the tails are where both data errors and outlier deals concentrate.

6.provider-level aqo & shrinkage

a provider with few transactions has a noisy own-estimate. rather than publish that noise or discard the provider, the provider-level figure is a bayesian shrinkage estimator^[6] that blends the provider’s own aqo with the category headline in proportion to how much data the provider has:

PAQO_π,c = ( κ_c · HAQO_c + n_i · AQO^own_π,c ) / ( κ_c + n_i )(3)

κ_c is the prior strength, default 30 transactions. a brand-new provider (n_i → 0) shrinks fully to the category mean; an established provider (n_i ≫ κ_c) is governed by its own data. shrinkage is also the steady-state reality-contact mechanism: it keeps provider figures honest before enough evidence exists to stand alone.

7.data tiers & admission

not all evidence is equal. each transaction carries a tier credibility weight τ_j set by its source. table 2 (the lower-cased mapping of the data-tier scheme) gives the weights.

table 2 · data-tier credibility weights

tier	source type	τ
a	workforce marketplace transactions; signed vendor feeds	1.0
b	github merged prs; upwork verified profiles; bls oews; sam.gov contracts; levels.fyi; apex	0.7
c	vendor pricing pages; stack overflow survey; glassdoor	0.1–0.3
d	analyst reports; blog posts; case studies	0.0 (context only)

Tab. 2 · tier weights. at v1 launch the corpus is essentially all tier b/c; tier a grows with marketplace volume. confidence intervals widen automatically as b/c weight rises — this is honest, not a weakness.

7.1 tier a admission

a candidate transaction is admitted as tier a only after all of: (i) the signature validates; (ii) all required fields are present and conformant to the published spec; (iii) the buyer-confirmation marker is present and non-revoked; and (iv) the output passes three-reviewer scoring at cohen’s κ ≥ 0.85^[5]. public-source tiers (b–d) are admitted under hiQ Labs v. LinkedIn^[8]: public pages only, full robots.txt compliance, source attribution on every row, no fake accounts, no logged-in scraping.

8.uncertainty quantification

a headline figure is never published as a bare point estimate. the confidence interval is computed by bootstrap^[7] — B = 10,000 resamples drawn with replacement from the weighted transaction set — and the 80% interval is the [10th, 90th] percentile of the resampled HAQO values:

CI_c = [ q₁₀(B), q₉₀(B) ], B = bootstrap(HAQO_c, 10⁴)(4)

8.1 status thresholds

the ci ratio — upper bound divided by lower bound — is the published measure of how trustworthy a figure is, and it drives a discrete status label. table 3 gives the cascade.

table 3 · ci-ratio status thresholds

status	condition (ci ratio = upper / lower)
live	≤ 1.5
preview	≤ 2.5
thin_data	≤ 4.0
below_threshold	none of the above pass

Tab. 3 · status cascade. false precision is the credibility killer; the index publishes an interval and a status, never a bare number. kbb publishes a fair purchase price range^[1] — wli publishes confidence intervals.

9.capability-cohort adjustment

a gpt-5-class agent and a gpt-7-class agent should not compete directly inside one headline. the cohort-adjusted figure divides the headline by a capability index normalized to 1.0 at the baseline cohort and epoch (2026-01-01):

CAQO_c,C(t) = HAQO_c(t) / CapIdx_C(t)(5)

CapIdx is derived from the apex-agents pass@1 leaderboard. the adjustment lets the index track the price of a task while holding capability fixed, so that a falling headline reflects a cheaper market rather than merely a stronger model generation.

10.measurability scope

aqo is defined only over commodifiable outputs. a four-tier scope fixes what is indexed and what never will be.

table 4 · four-tier measurability scope

tier	what is indexed	status
1	binary outputs: cs resolution, code merge, lead qualified, ticket closed	ships v1
2	continuous quality: consulting deliverables, legal briefs, financial models	v2–v3 (18–36 mo)
3	proxy metrics: therapy nps, r&d attribution	5–10 yr roadmap
4	ceo decisions, board governance, creative direction, relationship sales, crisis judgment	permanently out of scope

Tab. 4 · tier 4 is not a limitation; it is the methodology’s integrity. people are never indexed — only commodifiable outputs.

11.teams (interim policy)

a team transaction produces a bundled output (e.g. a coding agent + reviewer + deployer shipping a feature) that has no clean single-category aqo destination. until the hybrid index ships, team transactions follow a hold-and-backfill policy: all are captured into the same pipeline; none contaminate any single category’s headline; aggregate counts, gmv, and category mix are published in a separate “teams — awaiting hybrid” view; and on hybrid v1 all held transactions are backfilled with a documented ledger entry. sellers see their own provisional figure with an explicit “hybrid in development” disclosure. holding is recoverable; force-fitting or silently excluding is not.

12.worked example

two providers, sam and bob, each complete a transaction priced at c = 10 in the same category. sam is the stronger provider; bob is weaker. equation (1) applies directly.

table 5 · worked aqo (illustrative test fixture)

provider	cost c	eval e	outcome s	aqo = c / (e·s)
sam	10	0.9	0.9	12.35
bob	10	0.5	0.5	40.00

Tab. 5 · same sticker price, very different quality-adjusted cost. with sam at higher (tier b) volume, the volume-weighted median HAQO sits near sam’s 12.35, not bob’s 40.00. fixture from the methodology test suite (T1); not a market figure.

the example shows the core point: equal raw price, a 3.2× gap in aqo. a benchmark on raw price would call these identical; aqo separates them, and the volume-weighted median lets the higher-evidence provider govern the headline.

13.corrections & governance

the workforce labor index editorial board governs this methodology; board composition and conflict-of-interest disclosures are public, and changes to formula constants, tier weights, the shrinkage prior, or category boundaries require a public comment period, a board vote, and a methodology version increment that produces a new citation.

when a published figure is found to be wrong, the correction protocol forbids silent retraction and backdated correction. the affected figure is flagged but kept visible; a root-cause record is filed; the figure is recomputed with both versions shown for 30 days; and if the change is material or the figure was cited externally, every known external citation is named in the record. corrections never produce a bare point estimate — the uncertainty band is preserved across the error event.

14.reproducibility & availability

reproducibility is a published standard: an independent computation, given the same admitted corpus and the same methodology version, must reproduce the figure within one ci half-width. the reference implementation is an open calculator with a math-trace for every intermediate step, validated by automated suites — the core, round-two, teams/workflows, icm-workflow, and industry-benchmark suites together assert several hundred checks against the formulae in this paper.

the metric definition, the eval rubric, the sealed banks, the tier scheme, the status thresholds, and the reference calculator are released under cc-by-4.0. this is a v1.0 working draft published ahead of the first verified transactions; the figures shown throughout are illustrative sample or test-fixture data and are labeled as such.

references

[1]kelley blue book, fair purchase price and condition-adjusted valuation methodology. 1926, contemporary revision.

[2]workforce labor index editorial board, the workforce labor index: a transaction-anchored, iosco-aligned benchmark. methodology v1.0 (working draft), 2026. [/methodology →]

[3]federal reserve bank of new york, secured overnight financing rate (sofr) — methodology and statistics. 2018, revised 2024.

[4]international organization of securities commissions (iosco), principles for financial benchmarks — final report. fr-08/13, july 2013.

[5]cohen, j. a coefficient of agreement for nominal scales. educational and psychological measurement, 20(1), 37–46. 1960.

[6]efron, b. & morris, c. stein’s estimation rule and its competitors — an empirical bayes approach. journal of the american statistical association, 68(341), 117–130. 1973.

[7]efron, b. bootstrap methods: another look at the jackknife. annals of statistics, 7(1), 1–26. 1979.

[8]united states court of appeals, ninth circuit, hiQ labs, inc. v. linkedin corp. 2019, reaffirmed 2022.

[9]van clief, j. & mcdermott, d. interpretable context methodology: folder structure as agentic architecture. 2026. [full paper · pdf →]

cite the metric. then cite the figure.

paperaqo · companion to wli

statusworking draft

doiforthcoming

arxivforthcoming

licensecc-by-4.0

preferred formworkforce labor index editorial board (2026). the adjusted quality-outcome (aqo), methodology v1.0 (working draft).

run a free eval →try the aqo calculator →read the wli paper →see the live index →read the icm paper · pdf →