New methods for analyzing FAH data

Guest post from Dr. Gregory Bowman, UC Berkeley

Two general objectives of the Folding@home project are (1)
to explain the molecular origins of existing experimental data and (2) to
provide new insights that will inspire the next generation of cutting edge
experiments.  We have made tremendous
progress in both areas, but particularly in the first area.  Obtaining new insight is even more of an art
and, therefore, less automatable. 

To help facilitate new insights, I recently developed a
Bayesian algorithm for coarse-graining our models.  To explain, when we are studying some
process—like the folding of a particular protein—we typically start by drawing
on the computing resources you share with us to run extensive simulations of
the process.  Next, we build a Markov
model from this data.  As I’ve explained
previously, these models are something like maps of the conformational space a
protein explores.  Specifically, they
enumerate conformations the protein can adopt, how likely the protein is to
form each of these structures, and how long it takes to morph from one
structure to another.  Typically, our
initial models have tens of thousands of parameters and are capable of
capturing fine details of the process at hand. 
Such models are superb for making a connection with experiments because
we can capture all the little details that contribute to particular
experimental observations.  However, they
are extremely hard to understand. 
Therefore, it is to our advantage to coarse-grain them.  That is, we attempt to build a model with
very few parameters that is as close as possible to the original, complicated
model.  If done properly, the new model
can capture the essence of the phenomenon in a way that is easier for us to
wrap our minds around.  Based on the
understanding this new model provides, we can start to generate new hypotheses
and then test them with our more complicated models and, ultimately, via
experiment.

Statistical uncertainty is a major hurdle in performing this
sort of coarse-graining.  For example, if
we observe 100 transitions between a pair of conformations and each of these
transitions is slow, then we can be pretty sure this is really a slow
transition.  However, if we only observe
another transition once and it happens to occur slowly, who knows?  It could be that it is really a slow
transition.  On the other hand, it could
be we just got unlucky. 

Existing methods for coarse-graining our Markov models
assume we have enough data to accurately describe each transition.  Therefore, they often pick up these poorly
characterized transitions as being important (for protein folding, we typically
care most about the slow steps, so slow and important are synonymous).  The new method I’ve developed (described
here) explicitly takes into account how many times a transition was
observed.  Therefore, it can
appropriately place emphasis on the transitions we observed enough times to
trust while disregarding the transitions we don’t trust.  To accomplish this, I draw on Bayesian
statistics
.  I can’t do this subject
justice here, but if you’re ever trying to make sense of data that you have
varying degrees of faith in, I highly recommend you look into Bayesian statistics.