{"id":4793,"date":"2026-03-11T14:38:14","date_gmt":"2026-03-11T14:38:14","guid":{"rendered":"https:\/\/ft365.org\/index.php\/2026\/03\/11\/researchers-discover-major-security-gaps-in-llm-guardrails\/"},"modified":"2026-03-11T14:38:14","modified_gmt":"2026-03-11T14:38:14","slug":"researchers-discover-major-security-gaps-in-llm-guardrails","status":"publish","type":"post","link":"https:\/\/ft365.org\/index.php\/2026\/03\/11\/researchers-discover-major-security-gaps-in-llm-guardrails\/","title":{"rendered":"Researchers Discover Major Security Gaps in LLM Guardrails"},"content":{"rendered":"<div id=\"cphContent_pnlArticleBody\" data-layout-id=\"2\" data-edit-folder-name=\"text\" data-index=\"0\">\n<p>Security and safety guardrails in generative AI tools, deployed to prevent malicious uses like prompt injection attacks, can themselves be hacked through a type of prompt injection.<\/p>\n<p>Researchers at Unit 42, Palo Alto Networks\u2019 research lab, have found that large language models (LLMs) used by GenAI companies to enforce safety policies and evaluate output quality can be manipulated into authorizing policy violations through stealthy input sequences.<\/p>\n<p>Unit 42 refers to these LLMs as \u2018AI Judges\u2019 and said they are being increasingly deployed as AI operations scale.<\/p>\n<p>In a new report published on March 10, Unit 42 demonstrated an attack method that could target these \u2018AI Judges\u2019 and empower them to authorize policy violations.<\/p>\n<h2><strong>AdvJudge-Zero, Custom-Made Fuzzer for AI Judges<\/strong><\/h2>\n<p>The attack chain involves the use of AdvJudge-Zero, an automated fuzzer developed internally at Unit 42 to perform red-team style assessments.<\/p>\n<p>Fuzzers are tools that identify software vulnerabilities by providing unexpected input. AdvJudge-Zero functions with a similar approach to identify specific trigger sequences that exploit an LLM\u2019s decision-making logic to bypass security controls.<\/p>\n<p>The researchers noted that their technique differs from typical adversarial attacks on AI judges, which generally requires clear-box access to the model, meaning the attacker has full visibility to the internal structure of the system.<\/p>\n<p>\u201cIn contrast, AdvJudge-Zero employs an automated fuzzing approach. The tool interacts with an LLM strictly as a user would, using search algorithms to exploit the model&#8217;s own predictive nature,\u201d they wrote.<\/p>\n<h2><strong>Attack on AI Judges Explained<\/strong><\/h2>\n<p>The attack starts by probing the AI Judge and analyzing its next\u2011token probability distribution to identify tokens the model expects to see in natural text.<\/p>\n<p>Instead of random noise, the system prioritizes low\u2011perplexity tokens, innocent\u2011looking characters such as markdown symbols, list markers, or structural phrases, that appear normal to both humans and the model but can strongly influence the model\u2019s attention and reasoning.<\/p>\n<p>After gathering candidate tokens, AdvJudge-Zero repeatedly inserts these tokens into evaluation prompts and measures how the model\u2019s decision changes.<\/p>\n<p>Specifically, it monitors the logit gap \u2013 \u201cthe mathematical margin of confidence\u201d \u2013 between the tokens representing \u201callow\u201d and \u201cblock.\u201d By observing which tokens shrink the probability of a blocking decision, the fuzzer identifies formatting patterns that push the model closer to approving content.<\/p>\n<p>In the final stage, AdvJudge-Zero isolates combinations of these tokens that consistently steer the model toward an approval decision. These sequences act as subtle control elements that shift the model\u2019s internal reasoning, causing it to \u201callow\u201d the output even when the underlying content violates the GenAI company\u2019s policy and thus allow the tool to generate harmful content or perform cyber-attacks.<\/p>\n<h2><strong>99% Attack Success Rate<\/strong><\/h2>\n<p>Using this attack technique, Unit 42 achieved a 99% success rate in bypassing controls across several widely used architectures that customers rely on today, including open-weight enterprise LLMs, specialized reward models (i.e. LLMs specifically built and trained to act as security guards for other AI systems and commercial LLMs<\/p>\n<p><strong>\u201c<\/strong>Even the largest, most \u2018intelligent\u2019 models (with more than 70 billion parameters) were susceptible. Their complexity actually provides more surface area for these logic-based attacks to succeed,\u201d the researchers wrote.<\/p>\n<p>While this experiment showed that AI guardrails, including \u2018AI judges,\u2019 are susceptible to logic flaws, the researchers add that it also provides a solution.<\/p>\n<p>\u201cBy adopting adversarial training \u2013 running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples \u2013 organizations can harden their systems. This approach can reduce the attack success rate from approximately 99% to near zero,\u201d the Unit 42 blog concluded.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Security and safety guardrails in generative AI tools, deployed to prevent malicious uses like prompt injection attacks, can themselves be hacked through a type of prompt injection. Researchers at Unit 42, Palo Alto Networks\u2019 research lab, have found that large language models (LLMs) used by GenAI companies to enforce safety policies and evaluate output quality<\/p>\n","protected":false},"author":2,"featured_media":4794,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4793","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"featured_image_urls":{"full":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"thumbnail":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3-150x150.jpg",150,150,true],"medium":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"medium_large":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"large":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"1536x1536":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"2048x2048":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"morenews-featured":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"morenews-large":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"morenews-medium":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3.jpg",300,300,false],"crawlomatic_preview_image":["https:\/\/ft365.org\/wp-content\/uploads\/2026\/03\/4793-75587101-7475-4d75-9a72-40701e81f9f3-146x146.jpg",146,146,true]},"author_info":{"display_name":"henry","author_link":"https:\/\/ft365.org\/index.php\/author\/henry\/"},"category_info":"<a href=\"https:\/\/ft365.org\/index.php\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","tag_info":"Uncategorized","comment_count":"0","_links":{"self":[{"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/posts\/4793","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/comments?post=4793"}],"version-history":[{"count":0,"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/posts\/4793\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/media\/4794"}],"wp:attachment":[{"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/media?parent=4793"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/categories?post=4793"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ft365.org\/index.php\/wp-json\/wp\/v2\/tags?post=4793"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}