Guest post from Dr. Gregory Bowman, UC Berkeley

Two general objectives of the Folding@home project are (1)

to explain the molecular origins of existing experimental data and (2) to

provide new insights that will inspire the next generation of cutting edge

experiments. We have made tremendous

progress in both areas, but particularly in the first area. Obtaining new insight is even more of an art

and, therefore, less automatable.

To help facilitate new insights, I recently developed a

Bayesian algorithm for coarse-graining our models. To explain, when we are studying some

process—like the folding of a particular protein—we typically start by drawing

on the computing resources you share with us to run extensive simulations of

the process. Next, we build a Markov

model from this data. As I’ve explained

previously, these models are something like maps of the conformational space a

protein explores. Specifically, they

enumerate conformations the protein can adopt, how likely the protein is to

form each of these structures, and how long it takes to morph from one

structure to another. Typically, our

initial models have tens of thousands of parameters and are capable of

capturing fine details of the process at hand.

Such models are superb for making a connection with experiments because

we can capture all the little details that contribute to particular

experimental observations. However, they

are extremely hard to understand.

Therefore, it is to our advantage to coarse-grain them. That is, we attempt to build a model with

very few parameters that is as close as possible to the original, complicated

model. If done properly, the new model

can capture the essence of the phenomenon in a way that is easier for us to

wrap our minds around. Based on the

understanding this new model provides, we can start to generate new hypotheses

and then test them with our more complicated models and, ultimately, via

experiment.

Statistical uncertainty is a major hurdle in performing this

sort of coarse-graining. For example, if

we observe 100 transitions between a pair of conformations and each of these

transitions is slow, then we can be pretty sure this is really a slow

transition. However, if we only observe

another transition once and it happens to occur slowly, who knows? It could be that it is really a slow

transition. On the other hand, it could

be we just got unlucky.

Existing methods for coarse-graining our Markov models

assume we have enough data to accurately describe each transition. Therefore, they often pick up these poorly

characterized transitions as being important (for protein folding, we typically

care most about the slow steps, so slow and important are synonymous). The new method I’ve developed (described

here) explicitly takes into account how many times a transition was

observed. Therefore, it can

appropriately place emphasis on the transitions we observed enough times to

trust while disregarding the transitions we don’t trust. To accomplish this, I draw on Bayesian

statistics. I can’t do this subject

justice here, but if you’re ever trying to make sense of data that you have

varying degrees of faith in, I highly recommend you look into Bayesian statistics.