Stacking Gaussian processes to improve [Formula: see text] predictions in the SAMPL7 challenge

J Comput Aided Mol Des. 2021 Sep;35(9):953-961. doi: 10.1007/s10822-021-00411-8. Epub 2021 Aug 7.

ABSTRACT

Accurate predictions of acid dissociation constants are essential to rational molecular design in the pharmaceutical industry and elsewhere. There has been much interest in developing new machine learning methods that can produce fast and accurate pKa predictions for arbitrary species, as well as estimates of prediction uncertainty. Previously, as part of the SAMPL6 community-wide blind challenge, Bannan et al. approached the problem of predicting [Formula: see text]s by using a Gaussian process regression to predict microscopic [Formula: see text]s, from which macroscopic [Formula: see text] values can be analytically computed (Bannan et al. in J Comput-Aided Mol Des 32:1165-1177). While this method can make reasonably quick and accurate predictions using a small training set, accuracy was limited by the lack of a sufficiently broad range of chemical space in the training set (e.g., the inclusion of polyprotic acids). Here, to address this issue, we construct a deep Gaussian Process (GP) model that can include more features without invoking the curse of dimensionality. We trained both a standard GP and a deep GP model using a database of approximately 3500 small molecules curated from public sources, filtered by similarity to targets. We tested the model on both the SAMPL6 and more recent SAMPL7 challenge, which introduced a similar lack of ionizable sites and/or environments found between the test set and the previous training set. The results show that while the deep GP model made only minor improvements over the standard GP model for SAMPL6 predictions, it made significant improvements over the standard GP model in SAMPL7 macroscopic predictions, achieving a MAE of 1.5 [Formula: see text].

PMID:34363562 | DOI:10.1007/s10822-021-00411-8