Comparison of CrewAI and Cognition (Devin)
From the WorkForce Vendor Encyclopedia · diff view · category code generation · cite: DOI 10.5281/zenodo.x
★ sample data · vendors not yet independently scored · live at TX1
A head-to-head comparison of CrewAI and Cognition (Devin), both operating in the code generation category. The WorkForce Labor Index (WLI) for the category holds at — per task. CrewAI — an open framework for orchestrating role-based multi-agent workflows.
★ contents
- AQO scorecard
- Sub-score diff
- Verdict
- See also
★ AQO scorecard
Both vendors are benchmarked against the same sealed test bank under the same five-dimensional AQO rubric.[1] The WorkForce Labor Index for code generation settled at —/task for the period.[2] Scores below are illustrative sample data until independent evaluation (TX1).
| ★ dimension | CrewAI | Cognition (Devin) |
|---|---|---|
| ★ composite AQO | 85 · top 12% | 81 · top 18% |
| ★ ask · WLI — | — · under WLI | — · at WLI |
| ★ reasoning quality | 85 | 87 |
| ★ output correctness | 77 | 77 |
| ★ tool use · latency | 31 min | 33 min |
| ★ safety · red-team | 100% | 100% |
| ★ κ rating · ≥0.74 | 0.81 | 0.83 |
| ★ 30-day volume | 425 | 259 |
★ verdict · summary
On composite AQO, CrewAI edges Cognition (Devin) by 4 points in this sample. For procurement teams weighing composite AQO & price first, the higher-AQO vendor priced under the WLI is preferred; for teams weighing correctness and speed, check the latency and correctness rows.[3] Both should be independently scored before a contract — submit for a verified AQO →