{"ID":3005049,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T07:32:30.50480903Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03283","arxiv_id":"2606.03283","title":"SpeakerCard-1M: An Evidence-Grounded Speaker Card Corpus for In-the-Wild Speaker Verification","abstract":"Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition. This paper introduces SpeakerCard-1M, a bilingual speaker-centric resource for evidence-grounded SV, derived from VoxCeleb1/2 and CN-Celeb1/2, where the \"-1M\" suffix refers to the 1.78M utterance-level captions contained in the release. We adopt a tool-first, LLM-last approach: ten acoustic probes produce field-level evidence, the evidence is aggregated into speaker profiles under a schema that separates relatively stable traits from utterance-level states, and bilingual Speaker Cards are rendered by a constrained LLM that sees only the structured fields. The release includes 56.7K Speaker Card records over 10.2K speakers, 1.78M utterance-level captions, and speaker-ID-disjoint hard-negative triplets. We further define two SV-oriented cross-modal protocols, bidirectional Speaker-Text Retrieval (T2S-R / S2T-R) and Attribute-Conditioned Verification (AC-Verify), and compare a dual-encoder baseline against recent audio language models under a zero-shot forced-choice setting. Joint audio-text training increases VoxCeleb1-O EER by 0.31% absolute over the audio-only baseline. Under a style-symmetric LLM-generated counterfactual protocol, eight recent audio language models (7B-30B+ parameters, both open- and closed-source) score 49-77% on pitch-level AC-Verify under two-way forced choice, compared with 88.66% reached by our dual encoder.","short_abstract":"Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition...","url_abs":"https://arxiv.org/abs/2606.03283","url_pdf":"https://arxiv.org/pdf/2606.03283v1","authors":"[\"Junyi Peng\",\"Oldřich Plchot\",\"Xiao Song\",\"Dading Chong\",\"Lichun Fan\",\"Hang Su\",\"Themos Stafylakis\",\"Junjie Li\",\"Kong Aik Lee\",\"Shuai Wang\",\"Jan Černocký\"]","published":"2026-06-02T07:49:30Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
