{"ID":2848238,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.07433","arxiv_id":"2511.07433","title":"Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences","abstract":"In this work, we benchmark \\simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to traditional CCSD methods on the scale of amino acids. This enables the creation of affordable, large-scale \\textit{ab-initio} datasets, accelerating AI-driven optimization and discovery in the pharmaceutical industry and beyond. Our improvements are based on a novel and proprietary sampling scheme called Replica Exchange with Langevin Adaptive eXploration (RELAX).","short_abstract":"In this work, we benchmark \\simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline,...","url_abs":"https://arxiv.org/abs/2511.07433","url_pdf":"https://arxiv.org/pdf/2511.07433v1","authors":"[\"Fabio Falcioni\",\"Elena Orlova\",\"Timothy Heightman\",\"Philip Mantrov\",\"Aleksei Ustimenko\"]","published":"2025-10-30T19:19:56Z","proceeding":"physics.chem-ph","tasks":"[\"physics.chem-ph\",\"cs.AI\",\"physics.comp-ph\"]","methods":"[\"LoRA\"]","has_code":false}