{"ID":2842562,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.08903","arxiv_id":"2511.08903","title":"LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis","abstract":"Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2$\\pm$0.3 AP using only 5\\% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7$\\pm$0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1$\\pm$0.4 AP, p=0.02) and matching UDOP~\\cite{udop} (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7\\% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \\$12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.","short_abstract":"Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pi...","url_abs":"https://arxiv.org/abs/2511.08903","url_pdf":"https://arxiv.org/pdf/2511.08903v2","authors":"[\"Ibne Farabi Shihab\",\"Sanjeda Akter\",\"Anuj Sharma\"]","published":"2025-11-12T02:25:58Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false}
