LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild
Submitted by Reworr R on our AI alignment (2024 Jun) course
This project was runner-up for the “Technical Research” prize on our AI Alignment (June 2024) course. Participants worked on these projects for 4 weeks. The text below is an excerpt from the final project.
Project Overview
The LLM-Hack Agent Honeypot is a project designed to monitor, capture, and analyze autonomous AI Hacking Agents in the real world.
How It Works:
Simulation: We deploy a simulated “vulnerable” service to attract potential threats.
Catching Mechanisms: This service incorporates specific counter-techniques designed to detect and capture AI-Hacking Agents.
Monitoring: We monitor and log all interactions, waiting for potential attacks from LLM-powered agents.
Capture and Analysis: When an AI agent engages with our system, we capture the attempt and their system prompt details.
---
Why?
Our objectives aim to improve awareness of AI Hacking Agents and their current state of risks by understanding their real-world usage and studying their algorithms and behavior in the wild.
Full project
You can view the full project here.
