Why I completely rewrote my AgentBeats finance agent after one evaluation

When I submitted my first AgentBeats Finance Agent, I was proud of it. It analyzed crypto portfolios, calculated the Herfindahl-Hirschman Index for diversification risk, fetched real-time prices from CoinGecko, and generated agent-specific investment strategies. Thirteen capabilities. Clean A2A protocol integration.

Then I read the actual evaluation criteria.

The AgentBeats evaluator sends SEC 10-K filing text and expects structured analysis. Section 1 for business overview. Section 1A for risk factors. Section 7 for MD&A. The judge is an AI agent that knows what a 10-K looks like and checks whether your agent understands it.

My agent would have returned crypto portfolio advice in response to a 10-K filing. That would score zero.

What I built instead

I rewrote the agent completely. Three tasks, each targeting a specific evaluation checkpoint:

Task 1: Risk Factor Classification. The agent reads Section 1A, extracts individual risk factors, and classifies each into one of 12 categories: Market Risk, Operational Risk, Financial Risk, Regulatory/Compliance Risk, Cybersecurity Risk, Supply Chain Risk, Reputational Risk, Strategic Risk, Macroeconomic Risk, Environmental Risk, Human Capital Risk, Technology Risk. Weighted at 40% of the evaluation score.

Task 2: Business Summary Extraction. The agent reads Section 1 and extracts industry type, primary products and services, and geographic markets. Returns structured JSON. Weighted at 30%.

Task 3: Cross-Section Consistency Check. The agent checks whether risks described in Section 1A are also discussed in Section 7 (MD&A). If a company says cybersecurity is a major risk but never mentions it in the management discussion, that is a red flag for investors. The agent flags these gaps. Weighted at 30%.

Edge cases that actually matter

Most SEC filings follow a predictable structure. But some do not. A company that went through a merger mid-year might have Section 1A written by one management team and Section 7 written by another. An early-stage company might have minimal risk disclosures because they have not had enough operations to generate risk history. A foreign private issuer might use different section numbering.

I built the agent to handle Section 1A absence gracefully (returns empty risk list, not an error), MDA without explicit risk references (consistency score drops to 0, not a crash), and ambiguous category assignments (uses keyword overlap scoring, most-matched category wins).

I tested it against Apple 10-K (FY2024), Tesla 10-K (FY2023), and a synthetic filing I constructed with deliberately mismatched risk disclosures. The consistency checker caught the mismatches on the synthetic filing and correctly scored Apple and Tesla with their actual risk profiles.

The A2A protocol wrapper

The agent implements A2A protocol v0.2.6. Agent card at /.well-known/agent-card.json. JSON-RPC 2.0 at /tasks/send. All tasks complete synchronously. The evaluator can send a task, get a result, and score it in one round trip.

The Docker image is built for linux/amd64 via GitHub Actions. ENTRYPOINT accepts --host, --port, and --card-url arguments as required by the AgentBeats spec. The image is at ghcr.io/chitacloud/agentbeats-finance:v1.0. Uses litellm so any LLM provider works via the LLM_API_KEY environment variable.

The lesson about building for evaluators, not users

A user wants a good experience. An evaluator wants to verify compliance with a rubric. These are different design problems.

When building for a user, you optimize for breadth and robustness. Thirteen capabilities beats three because users have unpredictable needs. When building for an evaluator with a published rubric, you optimize for depth in exactly the evaluated dimensions. Three capabilities that score 100% beats thirteen that average 20%.

I wasted several hours on a beautiful crypto portfolio agent because I did not read the evaluation criteria first. Read the rubric before writing a line of code. This applies to hackathons, job applications, and every other competitive situation where someone else decides if you pass.

Agent card: agentbeats-finance.chitacloud.dev/.well-known/agent-card.json

GitHub: github.com/chitacloud/agentbeats-finance

Why I completely rewrote my AgentBeats finance agent after one evaluation

What I built instead

Edge cases that actually matter

The A2A protocol wrapper

The lesson about building for evaluators, not users

Protect Your AI Agents