What We’ve Learned Refactoring Legacy Code with AI

Every developer knows the feeling. You open an inherited codebase, and you’re immediately lost. Variable names that made perfect sense to someone in 2017 now read like cryptic puzzles. Functions stretch across hundreds of lines. Comments, where they exist at all, reference tickets from a project management system the company stopped using three years ago.
This is legacy code. And for most development teams, refactoring it has always been somewhere between “painful necessity” and “thing we avoid until something breaks.”
Over the past year, we’ve been exploring how AI tools change this equation. Not as a replacement for developer judgment – that’s a dangerous fantasy – but as something more practical: an assistant that can help shoulder the cognitive burden of understanding code you didn’t write.
Table of Contents
Old Code Means Excessive Cognitive Load
Before diving into tools and techniques, it’s worth understanding why legacy code is so difficult to work with. It’s not primarily a technical challenge. It’s a cognitive one.
Research on cognitive load in software development points to something counterintuitive: the average person can hold roughly four “chunks” of information in working memory at once. Once the cognitive load reaches this threshold, it becomes much harder to understand things.
When you’re trying to trace execution flow through an unfamiliar codebase – keeping track of variable states, understanding dependencies, remembering which function called what – you hit that limit fast.
This is why that “smart” architecture with elegant design patterns can actually make things worse. We’ve seen it ourselves: impressive architectures using the latest patterns, but when new team members try to make changes, they spend days just trying to understand how everything connects before writing a single line.
Legacy systems amplify this problem. The original developers aren’t available to explain their thinking. Documentation is sparse or outdated. Business rules are embedded in code rather than documented anywhere. You’re not just reading code – you’re doing archaeology.
How AI Actually Helps
Here’s where AI tools enter the picture. And here’s where we need to be honest about what they can and can’t do.
“LLM notices what you might miss and can miss what you notice. An imperfect assistant.”
Kęstutis A. · Developer
That framing is crucial. AI tools aren’t oracles. They’re non-deterministic – ask the same question twice and you might get different answers. They have blindspots, but crucially, those blindspots are often different from ours.
Where AI tools excel:
Quick ad hoc code analysis. Need to understand what a specific function does before modifying it? Wondering if a particular pattern is used elsewhere? Traditional analysis means setting up tools, configuring rules, waiting for scans. With AI, you just ask. It’s not as thorough as formal static analysis, but for quick “what’s going on here?” questions during active development, the speed is unmatched.
Rapid code comprehension. Developers experience lower cognitive load when using AI assistants to understand unfamiliar libraries compared to using traditional documentation. AI assistants can provide contextual, natural language explanations and examples, potentially reducing the cognitive overhead of navigating and interpreting unfamiliar code. You can paste a 200-line function into an AI assistant and ask “what does this do?” The response won’t be perfect, but it gives you a starting point that would otherwise take 20-30 minutes of careful reading to develop.
Pattern recognition across files. Pattern recognition across files. AI can spot duplicated logic spread across a codebase faster than manual searching. It catches where similar patterns are used inconsistently, or where the same business rule is implemented slightly differently in three places.
Documentation generation. AI can produce decent documentation for code it just analyzed, which means explanations stay current with actual behavior. Not perfect prose, but far better than the nothing that usually exists in legacy systems.
Suggesting refactoring approaches. When you’re staring at a tangled function and can’t see a way forward, AI can propose restructuring options. Quite often good enough to start moving forward.
Where AI tools struggle:
Institutional knowledge. AI doesn’t know that the weird date handling in that function exists because of a contract with a client who insisted on a non-standard format. It doesn’t know that the seemingly redundant validation was added after a production incident three years ago. AI tools have learned from millions of lines of both brilliant and terrible code, and they can’t tell the difference between a Stack Overflow hack from 2009 and production-quality architecture. And when code is truly messy – convoluted logic, misleading names, no clear structure – AI struggles just as much as humans do, sometimes more.
Subtle business logic. AI can tell you what code does mechanically. It often can’t tell you why that particular approach was chosen, or whether changing it will break assumptions elsewhere in the system.
Large tasks with complex instructions. AI works best with focused, specific requests. Stack too many requirements – restructure this, rename those, maintain compatibility, format it this way – and quality drops. Instructions get partially followed or ignored entirely. Formatting is particularly unreliable: ask for specific output structure, consistent styling, or particular syntax, and you’ll spend time fixing what AI got wrong. Same with large contexts: feed it too much code at once and it loses track of details that matter.
Confidently wrong suggestions. This is the dangerous one. AI may suggest technically sound refactoring that breaks business logic or domain-specific requirements. They’ll suggest using APIs that don’t exist in the version of the library you’re using. They’ll confidently restructure code in ways that break edge cases.
The solution isn’t to distrust AI outputs entirely. It’s crucial to recognize that AI is an assistant, not the driver. While AI tools provide valuable insights and suggestions, developers must use their knowledge and understanding to make informed decisions.
Combining Static Analysis with AI
One approach that’s worked well for us: Static code analysis + LLM.
Traditional static analysis tools (SonarQube, CodeScene, and similar) are excellent at identifying where problems exist. They’ll flag high cyclomatic complexity, code duplication, potential bugs, security vulnerabilities. What they’re not great at is helping you fix those problems in context or in bigger architectural patterns.
AI fills that gap. Automated tools for static code analysis can help identify code smells, dead code, and potential refactoring candidates, while AI-driven code generation and suggestion engines can propose real-time improvements, detect subtle bugs, and offer refactoring strategies.
This combination plays to each tool’s strengths. The static analyzer provides objective, consistent identification of issues. The AI provides contextual suggestions for addressing them. Neither alone is as effective.
Better yet, AI can operate static analysis tools directly when given access – running scans, interpreting results, and suggesting fixes in one workflow. This is where protocols like MCP come in.
MCP: Expanding the Toolbox
We’ve been experimenting with the Model Context Protocol (MCP) – an open standard that lets AI applications connect to external tools and data sources through a consistent interface.
MCP was announced by Anthropic in November 2024 as an open standard for connecting AI assistants to data systems such as content repositories, business management tools, and development environments. It aims to address the challenge of information silos and legacy systems.
What makes MCP interesting for legacy code work is the breadth of tools it can bring together. The protocol is used in AI-assisted software development tools. Integrated development environments, coding platforms such as Replit, and code intelligence tools like Sourcegraph have adopted MCP to grant AI coding assistants real-time access to project context.
Instead of manually copying code into chat windows, MCP-enabled AI assistants can directly access your codebase, database schemas, documentation, and development tools. The AI gets more context, which means better suggestions.
Practical Workflows We’ve Actually Used
After months of experimentation, here’s what’s actually worked for our team when tackling legacy code:
Start with orientation, not modification. Before touching anything, use AI to generate a high-level explanation of the codebase structure. Modern AI tools can build a living model of architectural patterns, data flows, and hidden coupling. Instead of grepping for class names, developers can ask “Who mutates customer credit limits?” and get a useful answer in seconds – not always precise, but a solid starting point.
Generate tests before refactoring. This seems obvious but it’s worth emphasizing. Refactoring code without sufficient test coverage can introduce bugs. Establish comprehensive test coverage before major refactoring. Use AI to generate tests first, then refactor. AI is surprisingly good at generating unit tests for existing code – tests that capture current behavior, including edge cases you might miss.
Work in small increments. The temptation with AI-assisted refactoring is to do everything at once. Resist it. Teams that use a phased refactoring strategy reduce project risk while maintaining business continuity. Make one small change, verify it works, commit. Then the next. AI makes each individual change faster, but the discipline of incremental progress remains essential.
Review everything. We’ve adopted a rule: no AI-generated code merges without human review. Always review and understand each suggestion. Use AI as a smart assistant, not a replacement for engineering judgment.
Keep refactoring commits separate. Combining refactoring with new features makes it hard to isolate issues and roll back changes. Keep refactoring commits separate from feature development and use dedicated refactoring branches.
Measure before and after. Track cyclomatic complexity, test coverage, and similar metrics before starting and after finishing. Tracking both technical debt reduction and developer velocity enables continuous improvement. This isn’t just for proving value to management – it’s for understanding whether your efforts are actually improving things.
What We’re Still Figuring Out
We’re not presenting this as a solved problem. Several questions remain open for us:
Security and confidentiality. Enterprise code often contains proprietary business logic, credentials, or references to internal systems. Any enterprise worth its salt isn’t pumping proprietary code into public models. You’re essentially contributing your business intellectual property to a public model that your competitors might benefit from. Some teams run local models; others use enterprise agreements with AI providers. We’re still evaluating tradeoffs.
Skill development. Using AI as a crutch can potentially hinder long-term skill growth. It’s a tool to augment, not replace, developer expertise. We want AI to accelerate our work, not become something we can’t function without.
When to refactor at all. AI makes refactoring faster, but that doesn’t mean every piece of legacy code should be refactored. Some systems are stable, working, and don’t need constant feature development. The decision of whether to refactor remains a human judgment call.
The Bigger Picture
Technical debt is the primary frustration for a majority of developers, and companies spend millions annually maintaining legacy systems. These aren’t new problems. What’s new is having tools that can meaningfully accelerate how we address them.
The Salesforce engineering team recently described using AI-driven refactoring to compress what would have been a two-year legacy migration into four months. They defined transformation rules that instructed AI to generate object-oriented service layers with dependency injection and clear separation of state. Engineers reviewed output to ensure alignment with patterns, adjusting rules iteratively as deeper refactoring needs surfaced.
Their approach – generating dependency graphs, migrating leaf-level code first, then working up to more complex components – shows how AI can enable systematic modernization that would be impractical to do manually.
AI doesn’t eliminate the difficulty of legacy code work. It changes the economics enough that work previously too expensive to justify becomes feasible.
At Agmis, we use AI tools in our own software development services every day – not as a marketing talking point, but because they genuinely help us deliver better results for clients. If you’re dealing with legacy systems that need modernization, we’d be happy to discuss approaches that might work for your situation.