Multifidelity data hierarchy study for excitation energies shows promising results for application of machine learning methods

V. Vinod and P. Zaspel. Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation EnergiesJ. Chem. Theory Comput, 21, 6, 3077–3091, 2025. DOI: 10.1021/acs.jctc.4c01491; also available as arXiv:2410.11392.

Multifidelity machine learning (MFML) has shown to reduce the time-cost of generating training data for machine learning (ML) models used in predicting quantum chemistry (QC) properties. MFML achieves this by using training data from different accuracies, or fidelities. In this work, Vivin Vinod and Peter Zaspel investigate the effect of the multifidelity data hierarchies on the model cost and accuracy. With a new error metric, the error contours of MFML, the work systematically studies the impact of the different fidelities on the overall model error. Based on this outcome, a new multifidelity approach, the Γ-curve is implemented and shown to be a highly efficient method resulting in low model error with as little as two training samples at the costliest fidelity.

New development in multi-fidelity machine learning methods opens up possibilities for the use of heterogeneous data for the prediction of quantum chemical properties

V. Vinod and P. Zaspel. Assessing non-nested configurations of multifidelity machine learning for quantum-chemical properties. Machine Learning: Science and Technology, 5, 045005, 2024. DOI: 10.1088/2632-2153/ad7f25; also available as arXiv:2407.17087.

Multi-fidelity methods in machine learning (ML) of quantum chemistry (QC) properties have made high accuracy low cost models more accessible to the community. These have been used in application for a range of properties including excitation energies. Most multi-fidelity methods usually require a nested configuration of the training data, that is, calculations for a geometry are to be made at the lower fidelities as well as the higher fidelities. 
In a recent work, available as a preprint the authors, Vivin Vinod and Peter Zaspel assess a non-nested configuration of multi-fidelity machine learning (MFML) and optimized MFML (o-MFML) methods. Preliminary results suggest that while MFML would still require a nested data structure, o-MFML can generalize reasonably well over a non-nested training data structure. That is, o-MFML opens up avenues for the use of heterogeneous datasets reducing the requirement to make costly calculations for high-fidelity data.

Dataset of diverse quantum chemical properties to enable research and benchmarking of multifidelity machine learning models released!

VinoV. Vinod, and P. Zaspel. QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse MoleculesSci Data 12, 202, 2025. DOI:https://doi.org/10.1038/s41597-024-04247-3; also available as arXiv:2406.14149.

V. Vinod and P. Zaspel. QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules (1.1.0) [Data set]. Zenodo. 2024. https://zenodo.org/records/13925688.

With research booming in the field of multifidelity methods for Quantum Chemistry (QC), it becomes important to benchmark the various methods in interest of meaningful comparison of the models. This allows for expedited research by setting standards which subsequent research assess with their own methodological developments. In interest of such a uniform comparison, the Quantum chemistry MultiFidelity (QeMFi) dataset was distributed to the community on an open source CC-BY-4.0 license. Containing 135k geometries of diverse and chemically complex molecules taken from the WS22 database, the QeMFi datatset contains QC properties ranging from excitation energies to molecular dipole moments. For each property, five fidelities of properties are provided with DFT accuracy. The fidelities themselves are formed on the basis set choice. This dataset is a major step in the direction of research and development of multifidelity machine learning methods for QC.

The authors of this work are Vivin Vinod and Peter Zaspel.