{"ID":2886517,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03793","arxiv_id":"2508.03793","title":"AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption","abstract":"Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.","short_abstract":"Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts re...","url_abs":"https://arxiv.org/abs/2508.03793","url_pdf":"https://arxiv.org/pdf/2508.03793v3","authors":"[\"Yanting Wang\",\"Runpeng Geng\",\"Ying Chen\",\"Jinyuan Jia\"]","published":"2025-08-05T17:56:51Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.CR\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611319,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886517,"paper_url":"https://arxiv.org/abs/2508.03793","paper_title":"AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption","repo_url":"https://github.com/Wang-Yanting/AttnTrace","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
