Blog
Why we built Ghostbank: A Real-World, Polyglot Banking App for AppSec Benchmarking
·
Thursday, September 11, 2025

Landon Lewis
Modern AppSec tools are getting great at demoing against intentionally contrived targets—but those targets rarely look like production systems. To evaluate security products (including our own) in conditions that resemble the messy reality of enterprise software, we built Ghostbank: a real, fully working online banking application with user management, checking and savings accounts, transfers, statements, payees, and support workflows—deliberately seeded with 13 vulnerabilities spanning authorization, authentication, input handling, state, and configuration.
Ghostbank started in Go and now exists in 15+ language ports (including Ruby, Python, PHP, Java, and .NET), enabling cross-language coverage and forcing detectors to generalize beyond framework-specific fingerprints. Ghostbank is core to Ghost Security’s product testing and agentic evaluation framework, and we also use it to benchmark competing tools—legacy and new—on a level playing field.
Why we built Ghostbank (and why not just use Juice Shop/DVWA/DVAPI)
Popular open-source targets are fantastic for training and education, but they have drawbacks for tool benchmarking:
Signature overfitting: Once vulnerabilities are public, it’s easy for engines to add direct signatures or rules that “memorize” test cases. That inflates results without reflecting real-world performance.
Unrealistic design: Many targets compress multiple flaws into toy endpoints that don’t mirror production data models, access patterns, or concurrency.
Framework bias: A single implementation encourages tool heuristics that overfit to one stack.
Ghostbank addresses these issues by being (1) realistic in architecture, (2) closed-source to prevent signature tuning, and (3) multi-language to stress cross-stack generalization.
Design goals
Production-like behavior: Real logins, sessions, RBAC, transactional money movement, statements, webhooks, support tickets.
Controllable vulnerability surface: 13 seeded issues across categories; togglable and parameterized for different difficulty levels.
Cross-language parity: Equivalent functionality implemented idiomatically in 15+ stacks.
Reproducible evaluation: Deterministic datasets, replayable user journeys, and consistent scoring across languages.
Fairness by default: Closed test corpus; no public signatures or “cheat sheets.”
What’s inside Ghostbank
Core features
Representative data model
Endpoints (subset)
The vulnerability suite (13 categories)
We vary exact implementations by language and build, but every release includes a balanced mix across:
BOLA / IDOR (cross-account data access via predictable IDs)
BFLA (role boundary lapses between user/admin/report)
Weak password reset (predictable or overly long-lived tokens)
Auth brute force (missing or porous rate-limiting)
CSRF on state-changing endpoints (e.g., transfers)
Race condition / double-spend (non-atomic balance updates)
Mass assignment (elevating isAdmin or altering account_id)
SQL injection (reporting/search surfaces)
Stored XSS (payee nicknames, ticket subjects)
SSRF (webhook validation gaps)
Path traversal (statement downloads)
Sensitive data exposure (PII secrets in logs/telemetry)
Misconfigured CORS/session (credential leakage via overly broad origins)
The precise defects and their parameters can be toggled to prevent overfitting and to support progressive difficulty.
Example: the race-condition “double-spend”
Here’s the essence of the bug we reproduce across stacks (language-specific idioms vary):
Hammering POST /api/transfers in parallel triggers inconsistent balances. In some languages we reproduce it via ORM race; in others via eventual-consistency of caches/queues. Variants test detectors that claim to catch business-logic and concurrency flaws—not just injection.
Multi-language ports (15+ stacks)
Shipped: Go, Ruby, Python, PHP, Java, .NET
Additional ports: We maintain additional idiomatic implementations to test language- and framework-agnostic detection. Each port preserves core functional parity and mirrors the same vulnerability classes with stack-specific idioms.
This forces AppSec engines and agents to reason semantically, not just match stack-specific patterns.
How we evaluate tools and models with Ghostbank
Harness & workflow
Build & seed: Spin up a fresh Ghostbank (selected language) with a known vulnerability profile.
Exercise: Deterministic user journeys (auth, transfers, payees, statements). Synthetic traffic includes normal and adversarial flows.
Collect signals: HTTP traces, DB queries, logs, taint flows, model/tool outputs.
Score: We calculate TP/FP/FN, precision/recall/F1, and false-positive rate per KLOC. For agentic systems, we also track time-to-valid-finding and exploitability grading.
Repeat across languages: Run the same scenarios over all Ghostbank ports to measure cross-stack generalization.
Fairness controls
Closed corpus: No public repo; results can’t be inflated by signature tuning.
Variant rotation: Endpoint names, field orders, and token formats rotate between runs.
Difficulty tiers: Easy→hard toggles alter entropy, validation rules, and timing windows.
Noise isolation: Clean baselines to attribute FPs precisely (e.g., benign SSRF-like calls).
Why closed-source for benchmarking?
Open source is great for learning. For benchmarking, it invites “teaching to the test.” Keeping Ghostbank private preserves the integrity of comparative results and better reflects how tools behave against unknown, real-world systems. (We still collaborate with partners under NDA and share detailed runbooks so results are reproducible.)
What we’ve learned so far (qualitative)
Authz > injection: Many tools still underperform on BOLA/BFLA—especially when object graphs and role transitions get realistic.
Concurrency is hard: Double-spend findings lag behind injection/XSS, even for advanced agents.
Language drift exposes overfitting: Precision drops when moving from the “favorite” stack to others, highlighting the value of multi-language parity tests.
(We avoid publishing raw numbers in a public blog so vendors aren’t singled out; customers and partners can see full reports under NDA.)
Integration with Ghost Security’s agentic evals
Ghostbank is wired into our internal agentic evaluation framework:
We run new models and agent strategies across all Ghostbank ports to maximize detection efficacy and minimize false positives before production releases.
We continuously regress against baselines to catch performance drift.
Competing tools are benchmarked with identical workloads for apples-to-apples comparisons.
Using Ghostbank (for partners and customers)
Run context: Single command brings up app + seed data + observability.
Profiles: Choose language, vulnerability tier, and traffic profile.
Outputs: Standardized SARIF/JSON findings, replay bundles, and dashboards.
Engagement model: We host evaluations or provide controlled access under NDA.
Interested in participating or validating your AppSec engine/agent against Ghostbank? Reach out—we’re happy to run a side-by-side.
What’s next
More ports (modern frameworks & async stacks)
Mobile/SPA front-end variants to test client-server authz drift
Cloud-native misconfig tracks (storage, IAM, policy drift)
Expanded business-logic scenarios (limits, chargebacks, disputes)
Closing thought
Benchmarks should reward reasoning, not memorization. By making Ghostbank realistic, multi-language, and closed to signature tuning, we’ve created a fairer yardstick for AppSec tooling—and a proving ground we use daily to push Ghost Security’s detection quality higher while driving false positives lower.
Appendix A: Example metrics we compute
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
FP rate / KLOC and time-to-valid-finding for agents
Appendix B: Sample evaluation scenario
Create user A and user B (distinct accounts).
User A initiates a transfer to user B.
Parallelize transfer submissions (N=50) to trigger race window.
Attempt to fetch GET /api/accounts/:id across users to probe BOLA.
Report all findings; verify exploitability with replay bundle.