{"ID":2921552,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T00:47:32.987482086Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00971","arxiv_id":"2606.00971","title":"HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering","abstract":"Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.","short_abstract":"Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper...","url_abs":"https://arxiv.org/abs/2606.00971","url_pdf":"https://arxiv.org/pdf/2606.00971v1","authors":"[\"Md Motaleb Hossen Manik\",\"Ge Wang\"]","published":"2026-05-31T03:02:05Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}