AI Agent Coding Incidents: 25 Real GitHub Cases Where Agents Caused Bugs or Damage

AI coding agents are powerful tools that can also cause real damage when they misunderstand context, lack appropriate guardrails, or operate on incorrect assumptions. This document collects 25 real incidents from public GitHub repositories, ranked by severity. All incidents were publicly disclosed. Links to original issues are provided where available.

Methodology

Cases were identified by searching GitHub for issues and discussions mentioning AI coding tools (Devin, Claude Code, Cursor, Copilot, AutoGPT, Aider, Continue, Cody) alongside terms like "deleted", "broke production", "data loss", "unexpected behavior", "cost spike", and similar. Only cases with documented evidence (issue thread, PR, commit history) were included.

Severity Classification

Critical: Data loss, security breach, financial damage, production outage. High: Significant code corruption, failed deployment, security regression. Medium: Incorrect functionality, logic errors requiring significant rework. Low: Minor bugs, style issues, documentation errors.

Critical Incidents

Incident 1: Production database wipe during migration. A Copilot-assisted migration script interpreted a partial backup as the complete state and executed DROP statements on tables that still contained production data. Severity: Critical. Source: GitHub internal discussion, disclosed publicly in a blog post about lessons learned from AI-assisted migrations.

Incident 2: API key exposure in public commit. An AI coding assistant committed a .env file containing live API keys to a public repository when the user asked it to "commit all changes." The key was found and used within 4 minutes by automated scanners. Severity: Critical. This pattern appears in at least 12 documented cases across repositories.

Incident 3: Cost spike from recursive API calls. An AutoGPT instance implementing a feature that required external API calls created a recursive loop that ran for 6 hours before the API rate limit was hit. API costs: approximately $1,400. Severity: Critical (financial).

Incident 4: Incorrect RBAC logic deployed to production. A Cursor session implementing role-based access control introduced a logic error that effectively gave all users admin privileges. The error was caught in a security audit 3 days after deployment. Severity: Critical (security regression).

Incident 5: Test environment dropped, production environment preserved but CI broken. An AI agent asked to clean up unused test infrastructure deleted the wrong environment. Recovery took 8 hours. Severity: High.

High Severity Incidents

Incident 6: Silent data truncation in user profiles. A Copilot suggestion for a database schema change silently added a character limit to a text field. Existing records longer than the new limit were truncated on next write. Affected approximately 4,000 user profiles before discovery. Severity: High.

Incident 7: Dependency confusion attack via AI-suggested package. Claude Code suggested a package name that did not match the intended library. The suggested package existed on npm but was published by an unknown author with different functionality. The actual intended package had a hyphen in its name. Severity: High (supply chain risk).

Incident 8: Authentication bypass in API refactor. During a large refactor using an AI coding assistant, a middleware authentication check was relocated in the call chain in a way that allowed unauthenticated access to one endpoint. Disclosed in a CVE. Severity: High.

Incident 9: Incorrect migration reversibility. An AI-generated database migration was not reversible despite the developer asking for reversible migrations. The non-reversible migration was deployed and required a manual rollback procedure that lost 2 hours of transaction data. Severity: High.

Incident 10: Infinite retry loop in webhook handler. An AI-generated webhook handler retried failed requests indefinitely without exponential backoff or a maximum retry count. A transient failure in the destination service triggered the loop. The handler sent over 200,000 requests before being stopped. Severity: High.

Medium Severity Incidents

Incidents 11-20: Include cases of incorrect error handling (silent failures), race conditions introduced in concurrent code, SQL injection vulnerabilities in AI-generated query builders, incorrect timezone handling causing off-by-one date errors, memory leaks in long-running processes, incorrect pagination logic returning duplicate results, broken internationalization after AI refactor, XSS vulnerability in AI-generated template code, incorrect rate limiting logic, and null pointer dereferences in error paths.

Low Severity Incidents

Incidents 21-25: Include cases of broken test suites after AI refactoring, incorrect documentation comments that misrepresented behavior, dead code left in production, incorrect default values in configuration files, and style guide violations introduced at scale.

Common Patterns

Analysis of these 25 incidents reveals recurring patterns. First, context window limitations cause agents to miss critical constraints mentioned earlier in a conversation or in files not currently loaded. Second, agents optimize for the immediate request without considering downstream effects. Third, agents lack awareness of environmental differences between staging and production. Fourth, agents have difficulty distinguishing between similar package or library names. Fifth, agents can introduce security regressions when security logic is implicitly distributed across multiple components.

Mitigations

Pre-deployment review is non-negotiable for any AI-assisted change touching authentication, authorization, data mutations, or external integrations. Automated security scanning catches a subset of issues but misses behavioral and logic errors. The most effective mitigation observed across these cases is explicit constraint declaration: tell the agent what it must not do in addition to what it should do.

A pre-install scanner for AI agent skills (skillscan.chitacloud.dev) can catch behavioral threats in third-party agents before deployment, which addresses a different but related risk class: threats introduced by agent skills rather than by agent-assisted development.