Blog
Why Your AI Security Tools Need to Learn, Not Just Analyze
·
Friday, October 31, 2025

Josh Larsen
Co-Founder and CTO
Let's address the elephant in the room: context engineering has become table stakes in AI-powered security analysis. Most frontier model providers now openly acknowledge the reality of context erosion. Even models that technically support massive context windows (1M+ tokens) can't truly make effective use of the window beyond a certain practical limit.
This isn't a controversial take anymore, it's industry consensus.
But here's the uncomfortable question: if everyone now understands that throwing more context at an LLM won't solve code security analysis, why are we still seeing security tools that operate as if it will?
The answer reveals a deeper problem. Context engineering was never the bottleneck. It was masking three more fundamental issues that most solutions in this space haven't solved: error compounding in multi-step analysis, shallow validation that misses process failures, and systems that operate in isolation from the institutional knowledge that makes findings actionable.
At Ghost, we've spent the last six months rebuilding our analysis engine from the ground up to address these issues. What we've learned has challenged some long-held assumptions about what "effective" AI security analysis actually means.
The Error Compounding Problem Nobody Wants to Talk About
Here's a thought experiment: imagine your agentic workflow where each analytical step is 95% accurate. Sounds pretty good, right?
Now run the math.
A three-step process is now 85% accurate. A ten-step process drops to 60% accurate. Is 60% accuracy tolerable in your business? Would you ship a security tool that gets more than one-third of its analysis wrong? If we've learned anything from the previous generation of static code analysis tools (which by some accounts had even lower efficacy rates), it's that inaccurate tools have a compounding effect on the time wasted by security and engineering teams heroically trying to triage all that noise.
The reality is we run the risk of committing the same fatal mistake in new and slightly different ways in the era of AI-powered everything. Error compounding can be the silent killer of LLM-based security analysis. No model is perfect at zero-shot or few-shot analysis, and code can be staggeringly complex. Every analytical step introduces the possibility of error, and exponential decay causes those errors to compound as the analysis deepens.
Traditional approaches try to solve this with bigger models, more context, or better prompts. These help at the margins. But they don't address the fundamental issue: without a mechanism to course-correct during analysis, errors accumulate until the final output is unreliable.
This is why adaptive learning isn't a nice-to-have feature, it's a structural necessity. An adaptive learning process that validates and corrects at each analytical step prevents error compounding. When done correctly, it creates an ensemble effect where results asymptotically approach 100% correctness, rather than drifting further from accuracy with each step.
Ghost's analysis engine now implements continuous self-correction throughout the analytical process. When an agent identifies a potential vulnerability, subsequent validation steps verify not just the finding itself, but the reasoning that led to it. If the reasoning is flawed, the system backtracks and re-analyzes with corrected assumptions. This isn't expensive prompt engineering—it's fundamental architectural design.
Validation Can't Be Skin Deep
There's a common misconception that validation in AI systems means "structured output." Make sure the JSON is valid. Check that required fields are present. Maybe run the output through another LLM as a judge.
This is necessary but nowhere near sufficient.
Judging correctness (whether as a human or with LLM judges) requires a layered approach that most systems simply don't implement. True validation operates at three distinct levels:
Process validation: Did the model take the appropriate analytical steps? Security analysis isn't just about the destination; it's about the journey. If an agent claims to have found an SQL injection vulnerability but never analyzed the query construction or data sanitization logic, the process failed, regardless of whether the conclusion happens to be correct.
Ghost's validation layer tracks analytical steps and verifies that required checks occurred. If an agent skips a critical analysis phase, the finding is flagged for re-analysis.
The validation layer also incorporates reflective analysis to measure the effectiveness of the process. The output of this reflection is an assessment of which steps worked and which didn't. This learning provides guidance to future agents and acts as a shortcut mechanism to minimize ineffective steps in the process.
Format validation: This is the structured output piece everyone implements, but with a crucial caveat. Your output format requirements can actually induce hallucinations if you're not careful.
Consider this scenario: your security analysis system requires every vulnerability to have an "exposure" field indicating whether it's exploitable from an internal or external network context. Sounds reasonable. But what happens when the analyzing agent lacks the contextual information to make that determination?
There's a very real chance you'll get a hallucinated answer. The model knows it must provide a value, so it will infer one from incomplete information. The output will be perfectly formatted. The JSON will validate. And the exposure classification will be wrong.
Subtle errors like this are incredibly difficult to detect because they don't trigger validation failures. The system appears to be working correctly. Only a human reviewer who happens to have the right context will catch the mistake—and only if they think to double-check that particular field.
Ghost's approach uses a two-tiered defense against hallucinated field values. First, we require agents to provide explicit evidence or reasoning alongside every critical field. Rather than just outputting `"exposure": "external"`, the agent must include `"exposure_evidence": "endpoint is exposed via public API route without authentication middleware at api/routes.ts:47"`. This structural requirement makes hallucinations far more detectable. A fabricated conclusion paired with vague or contradictory reasoning is much easier to catch than a bare field value. Second, our validation layer employs an LLM judge specifically scoring the soundness of that evidence. The judge asks: "Given the provided code context, does this evidence actually support the claimed exposure classification?" If the evidence is insufficient, contradictory, or appears to be inferred rather than observed, the finding is flagged for re-analysis. This two-stage process dramatically reduces hallucinations while maintaining automation—the agent doesn't guess, and the validator doesn't rubber-stamp.
Correctness validation: The final answer must be objectively correct. This is where many systems deploy an LLM judge -- a different model that renders a decision on the output.
This works, but only if the judge approaches the problem differently than the original analyzer. Ghost's correctness validation employs a separate model with a fundamentally different mandate: be skeptical. Rather than analyzing code to find vulnerabilities, the judge is tasked with interrogating findings. It has access to the same context and tooling as the analysis agents, but has more latitude (and token budget) to run themes to ground. Can we actually reach this vulnerable code path given the application's control flow? Does the claimed attack vector hold up when we trace data through sanitization layers? Is there a mitigating control the original analysis missed? This shift from "find vulnerabilities" to "stress-test this finding" creates meaningful separation between analysis and validation. The judge isn't just reproducing the first model's reasoning, it's actively trying to dispute it.
These three validation layers work together. Process validation catches analytical shortcuts. Format validation prevents hallucinated details. Correctness validation ensures the final finding is sound. Most importantly, failures at any layer trigger re-analysis rather than just flagging for human review. The system learns and improves before it asks for your time.
Human-Adjacent Learning: Not HITL, But HATL
Human-in-the-loop (HITL) has become shorthand for "we'll let a human fix it when the AI messes up." It sounds collaborative, but in practice it often means the AI makes its best guess, then waits for a human gatekeeper to approve or reject.
This is the wrong model. It creates more work for the humans, not less.
The reality of systems design (AI-powered, or otherwise) is that it's unrealistic to think we can anticipate all possible environmental conditions upfront. There are limitless variations to encounter in the real world. Codebases have unique architectures, custom security controls, internal libraries, and institutional knowledge that no amount of training data can capture.
Rather than treating human input as a gate at the end of the process, Ghost incorporates it as a nudge mechanism -- what we call Human-Adjacent-to-the-Loop (HATL). The system operates autonomously but allows human input to guide it toward the desired state at any point in the analysis.
Here's a concrete example: an agent analyzes authentication logic and flags a potential bypass vulnerability. The logic appears to allow unauthenticated access to a sensitive endpoint. Technically, the agent's analysis is correct given the code it examined.
But developers on your team know that `AcmeAuthModule` (a third-party identity library) handles authentication and authorization for this app through middleware that isn't visible in the application code itself. This is institutional knowledge the LLM can't know.
Traditional HITL would have the human reject the finding as a false positive. The AI's output gets overridden, but the AI doesn't learn anything. Next time it encounters similar code, it will make the same mistake.
HATL allows the developer to inject that knowledge as context: "AcmeAuthModule is our trusted identity service. It handles authentication and authorization for this app through RBAC."
The system then re-analyzes with that information, updating its understanding of the codebase. The finding is correctly downgraded or dismissed, and crucially, the next time the agent encounters `AcmeAuthModule` in this codebase, it already knows what it does.
This creates a flywheel effect. Each interaction makes the system smarter about your specific environment. Over time, the system requires less human input because it has accumulated the institutional knowledge needed to analyze your code correctly.
Rejection History as Proof of Efficacy
Now for the part that flips conventional wisdom on its head.
In traditional security tools, false positives are treated as pure noise -- mistakes that should have been caught before bothering a human. The goal is to minimize them. The dream is to eliminate them entirely.
This framing is shallow and outdated.
It assumes false positives are a binary state: a finding is either right or wrong. But this lacks the layered understanding of the analysis process that produces findings in the first place.
Consider the authentication bypass example from earlier. Without context about `AcmeAuthModule`, the finding is technically correct. Given the information available to the agent, identifying that vulnerability is exactly what it should do. From the agent's perspective, this is a true positive.
A human reviewer rejects it as a false positive because she knows about the mitigating control that the agent couldn't see. Same finding, different conclusion based on different information.
Here's the insight: that rejection isn't proof that the system failed. It's proof that the system worked correctly within its knowledge constraints, then successfully incorporated new information to improve.
Ghost surfaces rejection history not as a list of mistakes, but as a verifiable record of learning. When a customer asks, "How do I know your system won't flood me with false positives?" we show them exactly how the system processed previous rejections, what it learned, and how it applied that learning to subsequent analyses.
This is radically more transparent than vendors that claim low false positive rates without showing their work. Anyone can cherry-pick metrics. Showing the actual learning process (including the mistakes and corrections) builds trust in a way that polished marketing claims never can.
More importantly, it fundamentally reframes the conversation from "did the AI mess up?" to "how effectively does the system learn from incomplete information?" That's the question that actually matters.
300+ Agents, Countless Threat Vectors, Real Vulnerabilities
The Ghost platform is all about creating leverage -- adaptive learning, layered validation, human-adjacent feedback, transparent rejection history -- would be theoretical if the system didn't actually find vulnerabilities.
Ghost runs over 300 specialized agents, each focused on nuanced threat vectors across many different types of applications: web apps, APIs, mobile apps, CLI tools, microservices, and more. These aren't generic security checks. They're targeted analyses that understand context-specific and language-specific risks.
We're consistently finding previously undiscovered flaws in our customers' applications. Not hypothetical vulnerabilities. Not noisy false positives. Real security issues that matter.
The difference isn't that our models are bigger or our prompts are better. It's that the entire analytical architecture is designed to learn, validate deeply, and incorporate institutional knowledge rather than operating in isolation.
Context engineering is necessary but insufficient. The actual challenge is building systems that improve through use, validate comprehensively, incorporate human feedback effectively, and demonstrate their reliability through verifiable operation rather than vague and opaque claims.
That's what effective AI security analysis looks like.
Anything less is just context window theater.






