{"ID":2897229,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.06366","arxiv_id":"2507.06366","title":"DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction","abstract":"Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large-scale and high-quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self-supervised learning, especially graph contrastive learning (GCL), provides a unique opportunity to break the barrier by pre-training graph neural network models based on vast unlabeled complexes and fine-tuning the models on much fewer labeled complexes. However, the problem faces unique challenges, including a lack of a comprehensive unlabeled dataset with well-defined positive/negative complex pairs and the need to design GCL algorithms that incorporate the unique characteristics of such data. To fill the gap, we propose DecoyDB, a large-scale, structure-aware dataset specifically designed for self-supervised GCL on protein-ligand complexes. DecoyDB consists of high-resolution ground truth complexes (less than 2.5 Angstrom) and diverse decoy structures with computationally generated binding poses that range from realistic to suboptimal (negative pairs). Each decoy is annotated with a Root Mean Squared Deviation (RMSD) from the native pose. We further design a customized GCL framework to pre-train graph neural networks based on DecoyDB and fine-tune the models with labels from PDBbind. Extensive experiments confirm that models pre-trained with DecoyDB achieve superior accuracy, label efficiency, and generalizability.","short_abstract":"Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large-scale and high-quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self-supervised learning, especially grap...","url_abs":"https://arxiv.org/abs/2507.06366","url_pdf":"https://arxiv.org/pdf/2507.06366v1","authors":"[\"Yupu Zhang\",\"Zelin Xu\",\"Tingsong Xiao\",\"Gustavo Seabra\",\"Yanjun Li\",\"Chenglong Li\",\"Zhe Jiang\"]","published":"2025-07-08T20:02:53Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"q-bio.BM\"]","methods":"[\"Graph Neural Network\",\"Generative Adversarial Network\"]","has_code":false}