{"ID":2898491,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.03761","arxiv_id":"2507.03761","title":"Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)","abstract":"In the context of Extreme Multi-label Text Classification (XMTC), where labels are assigned to text instances from a large label space, the long-tail distribution of labels presents a significant challenge. Labels can be broadly categorized into frequent, high-coverage \\textbf{head labels} and infrequent, low-coverage \\textbf{tail labels}, complicating the task of balancing effectiveness across all labels. To address this, combining predictions from multiple retrieval methods, such as sparse retrievers (e.g., BM25) and dense retrievers (e.g., fine-tuned BERT), offers a promising solution. The fusion of \\textit{sparse} and \\textit{dense} retrievers is motivated by the complementary ranking characteristics of these methods. Sparse retrievers compute relevance scores based on high-dimensional, bag-of-words representations, while dense retrievers utilize approximate nearest neighbor (ANN) algorithms on dense text and label embeddings within a shared embedding space. Rank-based fusion algorithms leverage these differences by combining the precise matching capabilities of sparse retrievers with the semantic richness of dense retrievers, thereby producing a final ranking that improves the effectiveness across both head and tail labels.","short_abstract":"In the context of Extreme Multi-label Text Classification (XMTC), where labels are assigned to text instances from a large label space, the long-tail distribution of labels presents a significant challenge. Labels can be broadly categorized into frequent, high-coverage \\textbf{head labels} and infrequent, low-coverage...","url_abs":"https://arxiv.org/abs/2507.03761","url_pdf":"https://arxiv.org/pdf/2507.03761v1","authors":"[\"Celso França\",\"Gestefane Rabbi\",\"Thiago Salles\",\"Washington Cunha\",\"Leonardo Rocha\",\"Marcos André Gonçalves\"]","published":"2025-07-04T18:17:52Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[]","has_code":false}
