{"ID":2858582,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.06670","arxiv_id":"2510.06670","title":"PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch","abstract":"High-quality instruction data is critical for LLM alignment, yet existing open-source datasets often lack efficiency, requiring hundreds of thousands of examples to approach proprietary performance. In this work, we find that beyond the widely recognized importance of prompt-response quality, prompt difficulty itself plays a critical role in driving alignment gains. Motivated by this observation, we introduce PiKa, a data-efficient family of expert-level alignment datasets that concentrates supervision on high-difficulty instructions. The PiKa-SFT dataset contains only 30k examples, an order of magnitude fewer than state-of-the-art open datasets like Magpie-Pro. Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard. We also validate the generalizability of PiKa across the Qwen2.5 series (0.5B-7B), consistently surpassing their official instruction-tuned counterparts. Additionally, we provide 30k high-quality preference optimization examples to further enhance alignment. Our results demonstrate that promising alignment is achievable with significantly reduced data, democratizing access for resource-constrained research. Our code and data will be available at https://github.com/SJY8460/PiKa.","short_abstract":"High-quality instruction data is critical for LLM alignment, yet existing open-source datasets often lack efficiency, requiring hundreds of thousands of examples to approach proprietary performance. In this work, we find that beyond the widely recognized importance of prompt-response quality, prompt difficulty itself p...","url_abs":"https://arxiv.org/abs/2510.06670","url_pdf":"https://arxiv.org/pdf/2510.06670v2","authors":"[\"Shangjian Yin\",\"Shining Liang\",\"Wenbiao Ding\",\"Yuli Qian\",\"Zhouxing Shi\",\"Hongzhi Li\",\"Yutao Xie\"]","published":"2025-10-08T05:47:37Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":608561,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2858582,"paper_url":"https://arxiv.org/abs/2510.06670","paper_title":"PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch","repo_url":"https://github.com/SJY8460/PiKa","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
