Written by Hajime Tsui (@hajimetwi3) - January 2026
The abstract below is quoted from the preprint version 1.1. For the most up-to-date version, please refer to the Zenodo record.
Currently, guardrails are implemented in generative AI systems, such as chat-based AI,
to detect malicious users and to prevent harmful outputs through automated filtering
and content moderation in human-in-the-loop AI systems that involve both automated and
human review.
However, explicit attacks that target the guardrails themselves remain underexplored.
In this work, I examine this problem and define the following concepts:
Meta Attacks / Infiltration for AI
- Guardrail saturation attacks (Saturation Attacks), including:
- AI-review-based guardrail saturation attacks
- Human-review-based guardrail saturation attacks
- Exposure enhancement via human and AI-based review systems (including AI/AGI/ASI)
- Attacks on human reviewers (i.e., content moderators) via guardrails
- Attacks on AI-based review systems via guardrails
Unlike jailbreak or prompt injection attacks that aim to extract prohibited outputs,
these Meta Attacks target the guardrail mechanisms themselves, including human and AI reviewers.
Meta Attacks for AI, guardrails, human-in-the-loop, content moderators, Saturation Attacks, large language models, prompt injection, jailbreak, context engineering, AI red teaming
Written by Hajime Tsui
For questions regarding this project,
you may contact me via on X (formerly Twitter) at @hajimetwi3.
Please note that I may not always be able to respond, and your understanding is appreciated.
This project is a personal research record.
Suggestions from external users may be reviewed as reference information,
but they do not constitute co-authorship or co-invention.
No intellectual property claims regarding submitted suggestions will be recognized. Even if a suggestion is adopted, it will be incorporated only after independent evaluation and restructuring, and will not imply authorship of the original submitter.
If you have an interest in formal collaboration or joint research, you may express that interest.