Meta-Attacks-for-AI

DOI

Meta Attacks for AI: Human and AI Reviewers Under Attack

Written by Hajime Tsui (@hajimetwi3) - January 2026


Preprint


Abstract (from the Preprint)

The abstract below is quoted from the preprint version 1.0. For the most up-to-date version, please refer to the Zenodo record.

Currently, guardrails are implemented in generative AI systems, such as chat-based AI, 
to detect malicious users and to prevent harmful outputs through automated filtering 
and content moderation in human-in-the-loop AI systems that involve both automated and 
human review.

However, explicit attacks that target the guardrails themselves remain underexplored. 
In this work, I examine this problem and define the following concepts:

Meta Attacks / Infiltration for AI
- Guardrail saturation attacks (Saturation Attacks), including:
  - AI-review-based guardrail saturation attacks
  - Human-review-based guardrail saturation attacks
- Exposure enhancement via AI human review (Exposure Enhancement)
- Attacks on AI human reviewers (i.e., content moderators) via guardrails
- Attacks on AI reviewers via guardrails

Unlike jailbreak or prompt injection attacks that aim to extract prohibited outputs, 
these Meta Attacks target the guardrail mechanisms themselves, including human and AI reviewers. 


Author

Written by Hajime Tsui
GitHubPage: https://hajimetwi3.github.io/hajimetwi3/
GitHub: https://github.com/hajimetwi3
X (Twitter): https://x.com/hajimetwi3
Reddit: https://www.reddit.com/user/hajimetwi3/


Contact

For questions regarding this project, you may contact me via on X (formerly Twitter) at @hajimetwi3.
Please note that I may not always be able to respond, and your understanding is appreciated.

This project is a personal research record.
Suggestions from external users may be reviewed as reference information, but they do not constitute co-authorship or co-invention.

No intellectual property claims regarding submitted suggestions will be recognized. Even if a suggestion is adopted, it will be incorporated only after independent evaluation and restructuring, and will not imply authorship of the original submitter.

If you have an interest in formal collaboration or joint research, you may express that interest.