{"ID":2892830,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.14619","arxiv_id":"2507.14619","title":"Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining","abstract":"Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.","short_abstract":"Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach emp...","url_abs":"https://arxiv.org/abs/2507.14619","url_pdf":"https://arxiv.org/pdf/2507.14619v1","authors":"[\"Van-Hoang Le\",\"Duc-Vu Nguyen\",\"Kiet Van Nguyen\",\"Ngan Luu-Thuy Nguyen\"]","published":"2025-07-19T13:30:14Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
