Meta-Attacks-for-AI

Meta Attacks for AI: Human and AI Reviewers Under Attack

Written by Hajime Tsui (@hajimetwi3) - January 2026

Preprint

Zenodo (Latest version):
https://doi.org/10.5281/zenodo.18248787
X Announcement:
https://x.com/hajimetwi3/status/2011559002648559773?s=20
Note (Japanese version):
https://note.com/hajimetwi3/n/n910e8d585c9e

Abstract (from the Preprint)

The abstract below is quoted from the preprint version 1.1. For the most up-to-date version, please refer to the Zenodo record.

Currently, guardrails are implemented in generative AI systems, such as chat-based AI, 
to detect malicious users and to prevent harmful outputs through automated filtering 
and content moderation in human-in-the-loop AI systems that involve both automated and 
human review.

However, explicit attacks that target the guardrails themselves remain underexplored. 
In this work, I examine this problem and define the following concepts:

Meta Attacks / Infiltration for AI
- Guardrail saturation attacks (Saturation Attacks), including:  
  - AI-review-based guardrail saturation attacks  
  - Human-review-based guardrail saturation attacks  
- Exposure enhancement via human and AI-based review systems (including AI/AGI/ASI)  
  - Attacks on human reviewers (i.e., content moderators) via guardrails  
  - Attacks on AI-based review systems via guardrails  

Unlike jailbreak or prompt injection attacks that aim to extract prohibited outputs, 
these Meta Attacks target the guardrail mechanisms themselves, including human and AI reviewers. 

Keywords

Meta Attacks for AI, guardrails, human-in-the-loop, content moderators, Saturation Attacks, large language models, prompt injection, jailbreak, context engineering, AI red teaming

Author

Written by Hajime Tsui

ORCID: https://orcid.org/0009-0001-3050-1725
GitHubPage: https://hajimetwi3.github.io/hajimetwi3/
GitHub: https://github.com/hajimetwi3
X (Twitter): https://x.com/hajimetwi3
Reddit: https://www.reddit.com/user/hajimetwi3/
BlueSky: https://bsky.app/profile/hajimetwi3.bsky.social
Zenodo: Records
Note: https://note.com/hajimetwi3

Contact

For questions regarding this project, you may contact me via on X (formerly Twitter) at @hajimetwi3.
Please note that I may not always be able to respond, and your understanding is appreciated.

This project is a personal research record.
Suggestions from external users may be reviewed as reference information, but they do not constitute co-authorship or co-invention.

No intellectual property claims regarding submitted suggestions will be recognized. Even if a suggestion is adopted, it will be incorporated only after independent evaluation and restructuring, and will not imply authorship of the original submitter.

If you have an interest in formal collaboration or joint research, you may express that interest.