{"ID":2921895,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T23:52:15.800058688Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01504","arxiv_id":"2606.01504","title":"Semantic Retrieval for Product Search in E-Commerce","abstract":"Semantic retrieval in e-commerce must handle short, noisy, and colloquial queries over large product catalogs with fine-grained attribute distinctions. We present a Siamese LLM dual-encoder trained through a two-stage pipeline: contrastive learning with a false-negative margin mask to prevent penalization of near-duplicate products, followed by Relative Odds Alignment for Retrieval (ROAR), a preference optimization objective that extends Bradley-Terry to variable-sized graded relevance groups via consecutive odds-ratio margins. The training corpus mirrors this progression - substitute query-product pairs provide coarse semantic supervision in Stage 1 and graded relevance annotations drive fine-grained ranking in Stage 2. The resulting system accurately retrieves exact matches while correctly ordering substitutes and complementary products, with gains confirmed across query-frequency strata and business verticals, and statistical significance validated through live A/B deployment at scale.","short_abstract":"Semantic retrieval in e-commerce must handle short, noisy, and colloquial queries over large product catalogs with fine-grained attribute distinctions. We present a Siamese LLM dual-encoder trained through a two-stage pipeline: contrastive learning with a false-negative margin mask to prevent penalization of near-dupli...","url_abs":"https://arxiv.org/abs/2606.01504","url_pdf":"https://arxiv.org/pdf/2606.01504v1","authors":"[\"Nikhil Kothari\",\"Saksham Samdani\",\"Ritam Mallick\",\"Praveen Gupta\",\"Ankit Vijay\",\"Surender Kumar\"]","published":"2026-05-31T23:59:32Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}