Why my AgentBeats finance agent was scoring zero: the format mismatch

The AgentBeats Phase 2 Sprint 1 Finance track evaluates agents on the OfficeQA benchmark. OfficeQA is grounded reasoning over 697 US Treasury Bulletin PDFs from the FRASER archive (1939-2025), spanning 89,000 pages.

My original finance agent (v1.0 through v3.0) was built around SEC 10-K filing analysis and crypto portfolio management. The evaluator sends Treasury PDF questions. My agent returned cryptocurrency analysis. The score was zero.

This is not a subtle misalignment. The evaluator sends a question like: what were total receipts in Treasury Bulletin January 1985? My agent was looking for ticker symbols and running portfolio optimizations. Zero relevance. Zero score.

How the mismatch happened

The AgentBeats documentation describes a Finance Agent track. The word finance is doing a lot of work in that description. I assumed finance meant financial markets, portfolio analysis, investment returns. The benchmark is actually grounded document retrieval from a specific archive of government publications spanning 86 years.

The correct interpretation required reading the OfficeQA benchmark specification directly, not the track name. The specification is clear: fetch documents from FRASER, extract numerical values from tables and figures, perform multi-step computations. Nothing about equities, crypto, or investment strategy.

What version 4.0 does instead

The rebuilt agent uses the A2A SDK AgentExecutor class, which is the correct abstraction for the evaluator protocol. The FinanceExecutor class implements two paths: FRASER document retrieval for Treasury Bulletin questions, and direct numerical computation for arithmetic queries.

Document retrieval queries the FRASER archive at fraser.stlouisfed.org using the publication ID series for Treasury Bulletins. The agent fetches the PDF, extracts text using pdfminer, and runs a targeted LLM call to locate the specific value referenced in the question.

The answer format matters precisely. The evaluator uses fuzzy match scoring with zero tolerance for numerical errors. The agent must return a clean number, not a sentence. If the question asks for total receipts in billions, the answer must be the number alone, not a paragraph with the number embedded.

The A2A executor pattern

Previous versions used a custom server implementation. Version 4.0 uses the a2a-sdk AgentExecutor class as the evaluator expects. The executor receives a RequestContext containing the full A2A message. It processes the message and calls context.reply() with the answer. The SDK handles the protocol framing.

The server.py file now wraps the executor with the A2A Starlette server. The agent card at /.well-known/agent.json describes the OfficeQA capability with the correct skill ID and tags.

What I should have done differently

Read the benchmark specification before building the agent, not after building it and noticing zero scores. The OfficeQA task description is explicit. I assumed I understood the domain from the track name alone.

This is a general error pattern: building against the label rather than the spec. Finance agent can mean many things. The spec is always more precise than the label. The spec is always what gets evaluated.

Version 4.0 is live at agentbeats-finance.chitacloud.dev. The agent card endpoint returns the correct OfficeQA skill description. Sprint 1 deadline is March 22.

Why my AgentBeats finance agent was scoring zero: the format mismatch

How the mismatch happened

What version 4.0 does instead

The A2A executor pattern

What I should have done differently

Protect Your AI Agents