Protein-Protein Docking

Protein-protein interactions play a central role in various aspects of the structural and functional organization of the cell, and their elucidation is crucial for a better understanding of processes such as metabolic control, signal transduction, and gene regulation. Genome-wide proteomics studies, primarily yeast two-hybrid assays, will provide an increasing list of interacting proteins, but only a small fraction of the potential complexes will be amenable to direct experimental analysis. Thus, it is important to develop docking methods that can elucidate the details of specific interactions at the atomic level.

Our multistage approach to protein-protein docking

Our procedure starts with rigid body global search based on the Fast Fourier Transform (FFT) correlation approach that evaluates the energies of billions of docked conformations on a grid. In ClusPro 1.0 we use the docking programs DOT and ZDOCK, but in ClusPro 2.0 we have changed to our new program PIPER[1]. PIPER is also FFT-based, but the method is extended to be used with pairwise interaction potentials. With DOT and ZDOCK we retain 20,000 and 2,000 conformations, respectively. The number of structures is reduced by rigid body filters based on empirical potentials and electrostatics calculations. Due to the use of the more accurate pairwise potential in PIPER it is enough to retain 1000 structures, and we do not need the filtering step, The retained structures are clustered using the pairwise RMSD as the distance measure and a fixed or variable clustering radius[2]. We have shown that the 30 largest clusters contain at least one near-native structure (defined as having less that 10 Å RMSD from the ligand in the x-ray structure, calculated for ligand atoms that are within 10 Å of the fixed receptor) for 93 % of the complexes[3] in the protein docking benchmark set. The structures in these clusters are refined by a novel medium-range optimization method called SDU (Semi-Definite programming based Underestimation)[4] which has been developed to locate the global energy minima within the regions of the conformational space defined by the separate clusters. The procedure was used in the last rounds of CAPRI with very good results.

PIPER: FFT-based docking with pairwise potentials

PIPER performs exhaustive evaluation of an energy function in discretized 6D space of mutual orientations of two proteins. We sample 70,000 rotations which approximately correspond to sampling at every 5 degrees in the space of Euler angles. In the translational space the sampling is defined by the 1.2 Å grid cell size. The energy-like scoring function describing the receptor-ligand interactions is defined on this grid and is efficiently calculated using Fast Fourier transforms. Results are clustered with a 10 Å cube size, and one or several lowest energy translations for the given rotation are retained. Finally, results from different rotations are collected and sorted. The novelty of the PIPER algorithm is that the scoring function includes an energy term of the form Epair = ΣiΣjεij, where εij is a pairwise interaction potential between atoms i and j. The key to the efficient use of this potential within the FFT framework is the eigenvalue-eigenvector decomposition of the interaction matrix. The complete scoring function is given as the sum of terms representing shape complementarity, electrostatic, and desolvation contributions, the latter described by the pairwise potential. We have shown that PIPER increases the number of near-native conformations in the top 1000 or 2000 structures relative to other FFT-based docking programs[1]

PIPER can be easily tested using our new server ClusPro 2.0, and it is freely available for noncommercial applications.

Clustering of low energy docked conformations

We cluster the retained 1000 conformations using pairwise ligand RMSD as the distance measure. Our goal is finding large clusters of structures below a certain energy level, indicating minima that are both deep and have a broad region of attraction. We use a simple greedy algorithm to find the structures with the largest number of neighbors within a clustering radius rc. The value of rc depends on a clustering parameter 0 ≤ Δ ≤ 1, which is based on the histogram of pairwise RMSD values, and measures the depth of the separation between clusters[2]. We generally retain 30 clusters, each of them indicating a region of attraction around a local energy minimum.

We have recently reduced the number of retained clusters by testing the stability of local minima[5]. Since structures at narrow minima loose more entropy, some of the non-native states can be detected by determining whether or not a local minimum is surrounded by a broad region of attraction on the energy surface. The analysis is based on starting Monte Carlo Minimization (MCM) runs from random points around each minimum, and observing whether a certain fraction of trajectories converge to a small region within the cluster. The cluster is considered stable if such a strong attractor exists, has at least 10 convergent trajectories, is relatively close to the original cluster center, and contains a low energy structure. We studied the stability of clusters for enzyme-inhibitor and antibody-antigen complexes. All clusters that are close to the native structure are stable. Restricting considerations to stable clusters eliminates around half of the false positives, i.e., solutions that are low in energy but far from the native structure of the complex.

DARS (Decoys As the Reference State) potentials

The PIPER program was used with a new class of structure-based potentials called DARS (Decoys As the Reference State), based on the inverse Boltzmann approach (Kozakov et al., 2006). A statistical potential between two atoms ai and aj of types I and J, respectively, and located within a certain cutoff distance D, is defined by εIJ = -RT ln (pIJ), where R is the gas constant, T is the temperature, and pIJ denotes the probability of two atoms of types I and J interacting. This probability is approximated by the frequency pIJ = νobs(I,J) / νref(I,J) where νobs(I,J) is the observed number of interacting atom pairs of types I and J, and νref(I,J) is the expected number of interacting atom pairs of types I and J assuming an appropriate reference state. While νobs(I,J) can be directly determined by counting the number of intermolecular interactions between atoms of types I and J in a database of protein-protein complexes, the selection of the reference state remains a critical feature. The general assumption in the reference state is that the specific interactions determining the distribution of interaction sites are removed as much as possible. The most frequently used reference state is defined by νref(I,J) = νobs XI XJ where νobs is the total number of interacting pairs with the distance constraints d < ri,j < D, and XI and XJ are the mole fractions of atom types I and J, respectively.

The novelty of the DARS approach is that we generate a large decoy set of docked conformations using only shape complementarity as the scoring function (i.e., without any account for the atom types), and use the frequency of interacting atom pairs in these decoy structures as the reference state. Thus, developing the potential we compare the frequency of contacts between two specific atom types in X-ray structures of protein complexes to the frequency of contacts in the decoys that are devoid of specific interactions. Since the goal is finding complex conformations close to the native among the many structures that all have good shape complementarity, this scoring scheme is very natural, as it rewards the occurrence in the interface of the atom pairs that are frequently seen to interact in the native complexes. Due to the use of the more accurate DARS potential in PIPER, we do not need further filtering, and the top 1000 structures are retained for clustering.

Although the current version of the DARS potential improves docking results for almost all complexes, it works best for enzyme-inhibitor. We are currently developing special DARS potential for antigen-antibody complexes.

SDU: Docking refinement by semi-definite underestimation

SDU (Semi-Definite programming based Underestimation) is a stochastic global optimization method based on the assumption that the free energy, ΔG, is a funnel-like function within the region defined by each cluster[4]. Given a set of K local minima x1,…,xK of ΔG, we construct a convex quadratic function U(x) = x′Qx + b′x + c that underestimates ΔG at all local minima i.e., U(xi) ≤ ΔG(xi), i = 1, . . . ,K. The underestimator U(x) can be found by solving a semi-definite programming problem. Assuming that U(x) reflects the funnel-like behavior of free energy function, we call the global minimum xP of U the “predictive conformation”, where xP = −(1/2)Q-1b. We bias sampling toward xP by selecting points from the probability density function proportional to –U. The new points are used as starting states in local energy minimization, and then added to the set of local minima. High energy samples or those far from xP are discarded. If there is no progress in reducing the minimum energy over several iterations then the algorithms is terminated.

In spite of the success of SDU as an optimizer for functions with funnel-shaped basins, its application to docking turned out to be more difficult than expected. SDU is effective for selecting moves in either translational or rotational subspaces (Paschalidis et al., 2006). However, the direct application of SDU in the space of rotations and translations fails to yield useful underestimators. Alternating searches in rotational and translational subspaces yields a feasible but inefficient algorithm. We have substantially improved performance by separately optimizing the center-to-center distance and describing SE(3) in terms of five angles. It is potentially important that this strategy samples encounter complexes, and hence it is reminiscent of the model of molecular association through a series of micro-collisions.

ClusPro: Fully automated docking and discrimination servers

In 2004 we have implemented the first three steps of our multistage algorithm in ClusPro, the first fully automated protein-protein docking and discrimination server. ClusPro 1.0 includes (1) rigid body docking by DOT or ZDOCK; (2) selection of 2000 docked structures with favorable desolvation and electrostatics, (3) clustering of the retained complexes using a pairwise RMSD criterion, and reporting the centers of the 30 largest clusters. Since the server does not provide further discrimination, the most reasonable use of docking at this point is to consider the top predictions as putative structural models, and employ low resolution or non-structural experiments such as cross-linking or site-directed mutagenesis for selecting the correct model.

By September 2007, the server performed 13,915 docking calculations for 1,824 outside users. We have found over 35 research papers that used ClusPro as an important part of the methodology.

We have recently released ClusPro 2.0. The main change is that the new server docks the two proteins using the improved docking program PIPER.

Our results at CAPRI

CAPRI (Critical Assessment of Predicted Interactions), the first communitywide experiment devoted to protein docking (Janin et al., 2003). For each target, predictor groups may submit up to ten models to independent assessors, who compare the predictions with newly obtained X-ray structures of the complexes, which crystallographers make available for the evaluation. So far, CAPRI results were evaluated for 28 targets in Rounds 1-12. Two targets (22 and 23) were cancelled due to the early release of X-ray structures, and no acceptable solutions were submitted for three targets (4, 5, and 28), leaving 23 valid targets. The table here shows the summary of results for all groups that submitted at acceptable or better solutions for 10 or more targets in Rounds 1-12 of CAPRI. The table was obtained by summing the results for each group from the three separate CAPRI evaluation meetings. We have participated both as a predictor group (in Rounds 1-5 under the name of Camacho, who was a research assistant professor in the PI’s lab at that time) and as the ClusPro server (starting with Round 3). As shown, we have submitted acceptable or better predictions for 12 of the 23 valid targets, which amounts to the second best overall performance in CAPRI Rounds 1-12. The table also shows the performance of the ClusPro server, which had to submit predictions within 24 hours and without the use of any biological information.

Best performing predictor groups in Rounds 1-12
Predictor group * ** ***
Weng (Boston University) 14 7 3
Camacho/Vajda (Boston U.) 12 5 4
Abagyan (Scripps) 11 6 3
Baker (Univ. of Washington) 11 4 5
Wolfson/Nussinov (Tel Aviv U.) 11 3 1
Eisenstein (Weizmann Inst.) 10 1 4
ClusPro (Boston University) 7 2 1
*Acceptable, ** Medium, *** High accuracy prediction as defined by Mendez et al (2003)