How do you build an MSM using adaptive sampling?

To start a simulation project, we first must choose some initial conformations (a protein’s shape) to begin with. The heuristic methods we use so far include running high-temperature simulations, employing Rosetta’s Monte Carlo algorithm, and shooting off related MSMs of similar proteins. Once we have a set of conformations, each of them becomes the starting point for some simulations which together we call a Run. Within each Run, we launch many trajectories, each called a Clone. Thus, all of the Clones in a Run start from the same initial protein shape, but they have a different initial velocity, i.e. the atoms are given a different initial push in one direction or another. The Clones from a Run may find additional conformations, in which case that Run ends and several more Runs are started from them. This process continues with a lot of Runs branching out to other conformations, perhaps merging back together to a common shape with other Runs. In the end, we end up with a model with tens of thousands of different conformations, (terabytes of data!) and we can see all the shapes and energy states that the protein can take on while its folding towards its “native state”, the chances of all the transitions occurring, and how long it takes the protein to complete a transition from one conformation to another. More importantly, we can identify the places where the protein misfolds and gets stuck, which then leads to more research and models on how we can prevent this from happening. The more computers we have participating, the faster we can complete the Markov State Model.

Aren’t these the PRCG numbers?

Yes. Work Units are labeled with four distinct numbers in the format: Project (Run, Clone, Generation). We just described the first three; Project is the protein under study, a Run is a simulation started from a particular conformation, and Runs contain many Clones which have different initial velocities. Although Folding@home processes many different Projects, Runs, and Clones all at the same time, Clones themselves are serial in nature. They have to be simulated from start to finish, but it would be impractical for one computer to complete one by itself. Instead, your computer is given a piece of a Clone. We identify the piece using the Generation (Gen) number. One computer will start out with Generation 0, and when it finishes another computer is given Generation 1, etc. We cannot start Gen 1 until Gen 0 finishes, and there may be hundreds of Gens. This is why the Work Units have deadlines, and why speed is so important to us.