Blog

Super-powered Application Discovery and Security Testing with Agentic AI - Part 3

0 Mins Read

·

Wednesday, February 26, 2025

Brad Geesaman

Principal Security Engineer

Super-powered Application Discovery and Security Testing with Agentic AI - Part 3

In Part 1 of this blog series, we introduced Ghostbank as our designated target web application with a BOLA flaw in the transfer endpoint, and then we walked through the solution using Reaper. In Part 2, that same flaw was discovered, tested, and validated complete with a report writeup using an Agentic AI system named ReaperBot which interacted near-autonomously with Reaper via a suite of tools calls and APIs.

In this last part of the series, we are sharing some of the best practices that were incorporated in ReaperBot's development and testing so that the community can benefit from our lessons learned.  It's also important to be upfront about what challenges still remain for Agentic AI as applied to the AppSec space.

Best Practices

As we developed and refined the use case for ReaperBot interacting with Reaper, there were five key considerations that drove the biggest improvements in outcomes.  They are sorted by their amount of impact, but all are important to incorporate and validate in your own projects.

Agent Structure

Early in development, we tried several variations of agent personas and purposes until we landed on an agent structure that ensured each agent had discrete areas of focus with minimal overlap guided or "orchestrated" by a team lead responsible for interfacing with the user and setting a game plan for everyone to follow.  A single agent had too many responsibilities when it had more than 5 tools to use or had an "and" in its prompt describing its goals and objectives.

Model Choice

Once the roles and responsibilities were locked in, the next choice to make was which LLM and settings to choose for each agent.  The recommendation here is to use the strongest model for the orchestrator as it is the one giving instructions to the task handling agents.  Using a more responsive model for performing tool calls and collecting outputs to send back to the orchestrator tends to be a good balance of cost, quality, and latency.

Structured Outputs

If you want to save money on your favorite headache medication, do yourself a favor and implement structured outputs for as many interactions by the agents as possible. It allows for more reliable hand-offs between agents using fewer tokens and makes testing and validation efforts much easier.  Most Python agent frameworks leverage Pydantic to make the interaction between code and LLMs seamless.

Prompt Engineering

After giving each agent their own persona and goals in the system prompt, the place where the majority of tweaking happens is in the user prompts.  A couple of key points here:

  • Define what success looks like and what undesired responses look like.

  • Provide several good and bad examples to help "ground" the model.  LLMs are very good mimics, so giving examples to follow (whether a handful in the prompt or retrieved dynamically) are very helpful for getting reliable responses.

  • Provide a list of "do" and "do not" rules for it to follow. Help explain what you want, but even more importantly, provide instructions to not repeat undesired responses from prior runs.  For example, "Do not focus on other vulnerability types. Only focus on BOLA flaws." would prevent the agent from trying to include completely unrelated issues in the output.

  • Reinforce the response structure. For stronger models, this wasn't required, but for testing with local models with fewer parameter counts, it made a worthwhile difference to end the prompt with an instruction to respond according to the structured format provided to help make absolutely sure the model produced valid JSON.

Tool Calls and Integrations

As you develop the "tools" or "function calls" that you can provide to the LLM to help it gain context or perform tasks, some key learnings were:

  • Using long, descriptive tool names like reaper_get_live_hosts_for_domains with well defined docstrings including arguments and descriptions was critical to help the agent succeed in properly ordering tool calls and chaining outputs to inputs.

  • Start by stubbing out the tools with mock/hard coded outputs first to get a solid understanding of how well the orchestrator is handling the game plan in accordance with your prompts and the task agents with making good choices with tool calls. Once things are working well, make the logic in each tool real.

  • Be prepared to have tools be called multiple times in non-specific order.

  • Handle "no results" cases with nudges in the response to perform another action first to see these results to help guide the agent along in workflow chains.

  • Respond to the model with a status of success or failure with a reason alongside the actual output of the tool call so that the agent can feed that back to the user and not get stuck trying a failing action over and over.

Additional Considerations

Despite being a potential super-power, there are still challenges that every implementation needs to consider when implementing LLM or Agentic AI powered systems:

Security - Is the data used as input, examples, or reference material accurate and free of malicious content and intent?  Is the data completely isolated between tenants sharing an LLM/Agent's abilities?

Privacy - Is the data flowing through the system sensitive or contain intellectual property, and who has the ability to view this? Can this be used in feedback and training mechanisms by the model provider?

Safety - Does the system provide accurate responses for the use case? Are all interactions free from potential harm? Are tool calls acting on critical systems validated by a human first?

Transparency - Does the system provide a detailed log of what actions were taken and what data was shared with the model provider, if any?

Autonomy vs Control - Does the use case properly align with the right amount of autonomy commensurate with the risk? Does it provide the right measures of human validation before certain actions are taken?

Learning and Feedback - How does the system triage and incorporate both good and bad examples of interactions into the right places such that future interactions are better quality?

Change Management and Quality control - Does the system maintain a consistent standard of outcomes as improvements and changes are made?

With the ReaperBot use case clearly demonstrated, these points should hopefully be easier to understand at a practical level and provide you with a clearer set of questions to ask during your next threat modeling exercise involving AI.

Parting Thoughts

I hope you've enjoyed all three parts of this series.  We also hope by now you have a few Ghost Bucks from Ghostbank in your virtual wallet thanks to Reaper and ReaperBot and some concrete and useful takeaways on how to make Agentic AI a super-power of your own.  We'd love to hear your thoughts and feedback, so head on over to LinkedIn, and let's continue the conversation.

Step Into The Underworld Of
Autonomous AppSec

Step Into The Underworld Of
Autonomous AppSec

Step Into The Underworld Of
Autonomous AppSec

Ghost Security provides autonomous app security with Agentic AI, enabling teams to discover, test, and mitigate risks in real time across complex digital environments.

Join our E-mail list

Join the Ghost Security email list—where we haunt vulnerabilities and banish breaches!

© 2024 Ghost Security. All rights reserved

Ghost Security provides autonomous app security with Agentic AI, enabling teams to discover, test, and mitigate risks in real time across complex digital environments.

Join our E-mail list

Join the Ghost Security email list—where we haunt vulnerabilities and banish breaches!

© 2024 Ghost Security. All rights reserved

Ghost Security provides autonomous app security with Agentic AI, enabling teams to discover, test, and mitigate risks in real time across complex digital environments.

Join our E-mail list

Join the Ghost Security email list—where we haunt vulnerabilities and banish breaches!

© 2024 Ghost Security. All rights reserved