Breaking: Frontier Models Now Beat Human Accountants. Why Does Your Close Still Take All Month?
One question keeps coming up in accounting right now: how much of the actual work can AI do?
Transaction classification is a useful proving ground. It’s central to bookkeeping, repetitive enough to benchmark at scale, and nuanced enough to expose AI shortcomings.
In our latest Beyond the AI Hype benchmarks, we ran the same transaction-classification task three ways: frontier models, outsourced human accountants, and Digits Agentic General Ledger™. All three processed the same 2,000 transactions across four businesses. Results were scored against a U.S. GAAP ground-truth set by an independent panel of professional accountants.
Digits AGL® has led every benchmark we’ve run — and this year was no different. It finished at 97.8% accuracy, with sub-second turnaround and zero hallucinations, outperforming the best frontier model by 17 percentage points and every human baseline we’ve tested.
This year’s benchmark also produced a result we didn’t have before: frontier models beat the human baseline for the first time.
Five of the models, to be exact.
Year over year, frontier models are becoming faster, more accurate, and less prone to hallucination. Work that firms have historically delegated to junior staff, offshore teams, or layers of expensive software can now be performed at human-level accuracy by AI available to anyone.
That raises an uncomfortable question: if frontier models can deliver human-level transaction categorization accuracy out of the box, what differentiates purpose-built platforms like Digits?
When accuracy becomes a commodity.
What matters instead is what it can actually do for your firm:
Can it be deployed reliably across every client? Does it require human review for every edge case? Does it become more efficient as volume scales? Can it operate within the system of record, where context compounds with every transaction, correction, and close? Does it integrate with the rest of the accounting workflow? Does it provide the controls, auditability, and automation firms need to operate with confidence? Most importantly, does it help close the books — or simply categorize transactions faster?
A powerful model is increasingly accessible to everyone. The real differentiator is the system built around it.
Token economics
General-purpose models are configurable. Feed them more context (historical transactions, business rules, retrieval, search) and accuracy climbs. The best frontier model reached 87% with skills and tool configuration, a meaningful improvement over its 81% baseline.

But every point of improvement costs you. More context means more tokens, more latency, more orchestration. The enhanced workflows we tested ran 5–22x slower and consumed 8–19x more resources than basic inference. You're not just paying for the model — you're paying for everything you had to build around it to make it work. Now run that math across thousands of transactions, hundreds of clients, and every monthly close.

Digits AGL® hit 97.8% accuracy at sub-second turnaround with no incremental AI usage cost. The best frontier configuration capped out at 87% accuracy with 1,634x more processing time and 311x more compute.

With general-purpose models, performance and affordability move in opposite directions. With a purpose-built one, they don't have to.
The sync trap
Cost is real — but it’s not even the main problem.
Accounting software was supposed to be the place where the work got done. Instead, workflows turned into a patchwork of exports, imports, integrations, and syncs: data leaves the system of record, decisions happen somewhere else, and the answer comes back later. Every data handoff creates delay, reconciliation work, and potential truth drift.
Most AI bookkeeping products are running the same play. A transaction leaves the ledger, gets sent to an LLM, and comes back with a suggested category.
Replacing the external tool with an LLM does not eliminate the problem. It just gives the old sync problem a smarter-looking interface.
Accuracy was never the bottleneck. Architecture is.
We’ve been here before: accounting software gets capable, and firms still spend their evenings closing the books. Capability alone has never delivered ROI.
In fact, every generation of accounting software increased what firms could do while also increasing the complexity of operating it. AI is the most capable version of that pattern yet—which means the risk of inheriting that complexity is higher than ever.
In accounting, intelligence becomes useful when it operates within the rules, context, and history of the books it operates on.
The advantage of purpose-built systems comes from grounding intelligence in opinionated workflows, deterministic guardrails, and a complete audit trail, while preserving human judgment as the final authority.
That's what produced the 97.8% accuracy and sub-second turnaround in this benchmark. Not just a better model, but a system designed to manage that capability toward trusted outcomes.
Stay up to date with Digits
Unsubscribe anytime.
