{"ID":2843904,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.07002","arxiv_id":"2511.07002","title":"Automated Circuit Interpretation via Probe Prompting","abstract":"Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis -- a single prompt can take approximately 2 hours for an experienced circuit tracer. We present probe prompting, an automated pipeline that transforms attribution graphs into compact, interpretable subgraphs built from concept-aligned supernodes. Starting from a seed prompt and target logit, we select high-influence features, generate concept-targeted yet context-varying probes, and group features by cross-prompt activation signatures into Semantic, Relationship, and Say-X categories using transparent decision rules. Across five prompts including classic \"capitals\" circuits, probe-prompted subgraphs preserve high explanatory coverage while compressing complexity (Completeness 0.83, mean across circuits; Replacement 0.54). Compared to geometric clustering baselines, concept-aligned groups exhibit higher behavioral coherence: 2.3x higher peak-token consistency (0.425 vs 0.183) and 5.8x higher activation-pattern similarity (0.762 vs 0.130), despite lower geometric compactness. Entity-swap tests reveal a layerwise hierarchy: early-layer features transfer robustly (64% transfer rate, mean layer 6.3), while late-layer Say-X features specialize for output promotion (mean layer 16.4), supporting a backbone-and-specialization view of transformer computation. We release code (https://github.com/peppinob-ol/attribution-graph-probing), an interactive demo (https://huggingface.co/spaces/Peppinob/attribution-graph-probing), and minimal artifacts enabling immediate reproduction and community adoption.","short_abstract":"Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis -- a single prompt can take approximately 2 hours for an experienced circuit tracer...","url_abs":"https://arxiv.org/abs/2511.07002","url_pdf":"https://arxiv.org/pdf/2511.07002v1","authors":"[\"Giuseppe Birardi\"]","published":"2025-11-10T11:53:36Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":607259,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2843904,"paper_url":"https://arxiv.org/abs/2511.07002","paper_title":"Automated Circuit Interpretation via Probe Prompting","repo_url":"https://github.com/peppinob-ol/attribution-graph-probing","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
