Blog

Why we built Ghostbank: A Real-World, Polyglot Banking App for AppSec Benchmarking

0 Mins Read

Thursday, September 11, 2025

Landon Lewis

Modern AppSec tools are getting great at demoing against intentionally contrived targets—but those targets rarely look like production systems. To evaluate security products (including our own) in conditions that resemble the messy reality of enterprise software, we built Ghostbank: a real, fully working online banking application with user management, checking and savings accounts, transfers, statements, payees, and support workflows—deliberately seeded with 13 vulnerabilities spanning authorization, authentication, input handling, state, and configuration.

Ghostbank started in Go and now exists in 15+ language ports (including Ruby, Python, PHP, Java, and .NET), enabling cross-language coverage and forcing detectors to generalize beyond framework-specific fingerprints. Ghostbank is core to Ghost Security’s product testing and agentic evaluation framework, and we also use it to benchmark competing tools—legacy and new—on a level playing field.

Why we built Ghostbank (and why not just use Juice Shop/DVWA/DVAPI)

Popular open-source targets are fantastic for training and education, but they have drawbacks for tool benchmarking:

Signature overfitting: Once vulnerabilities are public, it’s easy for engines to add direct signatures or rules that “memorize” test cases. That inflates results without reflecting real-world performance.
Unrealistic design: Many targets compress multiple flaws into toy endpoints that don’t mirror production data models, access patterns, or concurrency.
Framework bias: A single implementation encourages tool heuristics that overfit to one stack.

Ghostbank addresses these issues by being (1) realistic in architecture, (2) closed-source to prevent signature tuning, and (3) multi-language to stress cross-stack generalization.

Design goals

Production-like behavior: Real logins, sessions, RBAC, transactional money movement, statements, webhooks, support tickets.
Controllable vulnerability surface: 13 seeded issues across categories; togglable and parameterized for different difficulty levels.
Cross-language parity: Equivalent functionality implemented idiomatically in 15+ stacks.
Reproducible evaluation: Deterministic datasets, replayable user journeys, and consistent scoring across languages.
Fairness by default: Closed test corpus; no public signatures or “cheat sheets.”

What’s inside Ghostbank

Core features
Representative data model
Endpoints (subset)

The vulnerability suite (13 categories)

We vary exact implementations by language and build, but every release includes a balanced mix across:

BOLA / IDOR (cross-account data access via predictable IDs)
BFLA (role boundary lapses between user/admin/report)
Weak password reset (predictable or overly long-lived tokens)
Auth brute force (missing or porous rate-limiting)
CSRF on state-changing endpoints (e.g., transfers)
Race condition / double-spend (non-atomic balance updates)
Mass assignment (elevating isAdmin or altering account_id)
SQL injection (reporting/search surfaces)
Stored XSS (payee nicknames, ticket subjects)
SSRF (webhook validation gaps)
Path traversal (statement downloads)
Sensitive data exposure (PII secrets in logs/telemetry)
Misconfigured CORS/session (credential leakage via overly broad origins)

The precise defects and their parameters can be toggled to prevent overfitting and to support progressive difficulty.

Example: the race-condition “double-spend”

Here’s the essence of the bug we reproduce across stacks (language-specific idioms vary):

// Pseudocode: naive transfer
func transfer(fromID, toID string, amount int64) error {
    from := loadAccount(fromID)
    to   := loadAccount(toID)

    if from.Balance < amount {
        return errors.New("insufficient funds")
    }
    // Non-atomic updates; no row/version locking
    from.Balance -= amount
    to.Balance   += amount

    saveAccount(from)
    saveAccount(to)
    recordTxn(fromID, toID, amount)
    return nil
}

Hammering POST /api/transfers in parallel triggers inconsistent balances. In some languages we reproduce it via ORM race; in others via eventual-consistency of caches/queues. Variants test detectors that claim to catch business-logic and concurrency flaws—not just injection.

Multi-language ports (15+ stacks)

Shipped: Go, Ruby, Python, PHP, Java, .NET
Additional ports: We maintain additional idiomatic implementations to test language- and framework-agnostic detection. Each port preserves core functional parity and mirrors the same vulnerability classes with stack-specific idioms.

This forces AppSec engines and agents to reason semantically, not just match stack-specific patterns.

How we evaluate tools and models with Ghostbank

Harness & workflow

Build & seed: Spin up a fresh Ghostbank (selected language) with a known vulnerability profile.
Exercise: Deterministic user journeys (auth, transfers, payees, statements). Synthetic traffic includes normal and adversarial flows.
Collect signals: HTTP traces, DB queries, logs, taint flows, model/tool outputs.
Score: We calculate TP/FP/FN, precision/recall/F1, and false-positive rate per KLOC. For agentic systems, we also track time-to-valid-finding and exploitability grading.
Repeat across languages: Run the same scenarios over all Ghostbank ports to measure cross-stack generalization.

Fairness controls

Closed corpus: No public repo; results can’t be inflated by signature tuning.
Variant rotation: Endpoint names, field orders, and token formats rotate between runs.
Difficulty tiers: Easy→hard toggles alter entropy, validation rules, and timing windows.
Noise isolation: Clean baselines to attribute FPs precisely (e.g., benign SSRF-like calls).

Why closed-source for benchmarking?

Open source is great for learning. For benchmarking, it invites “teaching to the test.” Keeping Ghostbank private preserves the integrity of comparative results and better reflects how tools behave against unknown, real-world systems. (We still collaborate with partners under NDA and share detailed runbooks so results are reproducible.)

What we’ve learned so far (qualitative)

Authz > injection: Many tools still underperform on BOLA/BFLA—especially when object graphs and role transitions get realistic.
Concurrency is hard: Double-spend findings lag behind injection/XSS, even for advanced agents.
Language drift exposes overfitting: Precision drops when moving from the “favorite” stack to others, highlighting the value of multi-language parity tests.

(We avoid publishing raw numbers in a public blog so vendors aren’t singled out; customers and partners can see full reports under NDA.)

Integration with Ghost Security’s agentic evals

Ghostbank is wired into our internal agentic evaluation framework:

We run new models and agent strategies across all Ghostbank ports to maximize detection efficacy and minimize false positives before production releases.
We continuously regress against baselines to catch performance drift.
Competing tools are benchmarked with identical workloads for apples-to-apples comparisons.

Using Ghostbank (for partners and customers)

Run context: Single command brings up app + seed data + observability.
Profiles: Choose language, vulnerability tier, and traffic profile.
Outputs: Standardized SARIF/JSON findings, replay bundles, and dashboards.
Engagement model: We host evaluations or provide controlled access under NDA.

Interested in participating or validating your AppSec engine/agent against Ghostbank? Reach out—we’re happy to run a side-by-side.

What’s next

More ports (modern frameworks & async stacks)
Mobile/SPA front-end variants to test client-server authz drift
Cloud-native misconfig tracks (storage, IAM, policy drift)
Expanded business-logic scenarios (limits, chargebacks, disputes)

Closing thought

Benchmarks should reward reasoning, not memorization. By making Ghostbank realistic, multi-language, and closed to signature tuning, we’ve created a fairer yardstick for AppSec tooling—and a proving ground we use daily to push Ghost Security’s detection quality higher while driving false positives lower.

Appendix A: Example metrics we compute

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)
FP rate / KLOC and time-to-valid-finding for agents

Appendix B: Sample evaluation scenario

Create user A and user B (distinct accounts).
User A initiates a transfer to user B.
Parallelize transfer submissions (N=50) to trigger race window.
Attempt to fetch GET /api/accounts/:id across users to probe BOLA.
Report all findings; verify exploitability with replay bundle.

Step Into The Underworld Of
Autonomous AppSec

Request a Demo

Step Into The Underworld Of
Autonomous AppSec

Request a Demo

Step Into The Underworld Of
Autonomous AppSec

Request a Demo

Ghost Security provides autonomous app security with Agentic AI, enabling teams to discover, test, and mitigate risks in real time across complex digital environments.

Join our E-mail list

Join the Ghost Security email list—where we haunt vulnerabilities and banish breaches!

Platform

Solutions

Resources

Company

Ghost Security provides autonomous app security with Agentic AI, enabling teams to discover, test, and mitigate risks in real time across complex digital environments.

Join our E-mail list

Join the Ghost Security email list—where we haunt vulnerabilities and banish breaches!

Platform

Solutions

Resources

Company

Ghost Security provides autonomous app security with Agentic AI, enabling teams to discover, test, and mitigate risks in real time across complex digital environments.

Join our E-mail list

Join the Ghost Security email list—where we haunt vulnerabilities and banish breaches!

Platform

Solutions

Resources

Company

Why we built Ghostbank: A Real-World, Polyglot Banking App for AppSec Benchmarking