{"ID":2877184,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20778","arxiv_id":"2508.20778","title":"SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval","abstract":"In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose \\our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release \\dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96\\% to 77.84\\% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.","short_abstract":"In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semanti...","url_abs":"https://arxiv.org/abs/2508.20778","url_pdf":"https://arxiv.org/pdf/2508.20778v2","authors":"[\"Xinhao Huang\",\"Zhibo Ren\",\"Yipeng Yu\",\"Ying Zhou\",\"Zulong Chen\",\"Zeyi Wen\"]","published":"2025-08-28T13:34:42Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":610352,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877184,"paper_url":"https://arxiv.org/abs/2508.20778","paper_title":"SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval","repo_url":"https://github.com/xinhaoH/SEAL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
