What are Markov State Models?

Protein folding is statistical in nature, so a protein can fold in many ways. We need a map to be able to see the bigger picture. Markov State Models (MSMs) are a way of describing all the conformations (shapes) a protein – or other biomolecule for that matter – explores as a set of states (i.e. distinct structures) and the transition rates between them. They also map out the protein’s motion and energy properties as it folds from one shape to another. Once we have all this information, we can observe the factors that influenced folding, which is especially important if the protein misfolds. Much of the theory underlying these methods is quite old but their use has been limited by the challenges inherent to identifying a reasonable set of states.

MSMs are particularly useful to us as they facilitate parallelization across many computer processors by allowing for the statistical aggregation of short, independent simulation trajectories. This replaces the need for single long trajectories, and thus has been widely employed by distributed computing networks such as Folding@home and GPUGRID. Further, through adaptive sampling, MSMs provide a way to increase the efficiency of simulation without introducing artificial biases or approximations. We’ve been making a lot of progress with developing Markov state model (MSM) methods for analyzing the data we generate with the help of the FAH community. Several Pande Group members include Drs. Xuhui Huang and Gregory Bowman have developed MSMBuilder, an open-source software package used to build, analyze, and visualize MSMs. Since its release in 2009, it’s been download over 1,600 times across five continents and has been used in at least 40 publications to date.

Formally, MSMs are a specific application of discrete-space master equations parameterized from simulation. They consist of two parts: a state space partitioning X, typically chosen to divide the system into a set of metastable states; and a master equation describing kinetics on X, represented by either a transition matrix T or rate matrix. Both the state space and master equation are found from molecular simulation. The precise manner in which this is done varies widely.