Abstract
Background and Literature Survey
Conventional prediction uses rigid docking algorithms, but the computational results have been unsatisfactory. This is largely due to the fact that protein structures are flexible in live environments. First, the protein backbone is subject to changes during docking, and second, the involving motifs, side chains, in the bound state can be differently conformed in contrast to the unbound state. Though the conformational deformations of the side chains are not as significant as the backbone movements, they do play an important role in forming the close fit between two docking proteins. Therefore, such flexibility in motifs should be taken into account in the docking algorithms.
In the literature, a divide-and-conquer strategy is adopted for docking algorithms, with an initial-stage of candidates (hits) followed by a refinement step. 1 The refinement step aims to build an effective scoring function 2 to rank the near-native docking candidates in the top of the list. To deal with the side-chain flexibility in motifs, Cherfils et al 3 described the side chains with a crude low-resolution model to account for flexibility. Jackson et al 4 introduced side-chain flexibility into a two-stage process, which is as follows: (i) to apply the self-consistent mean field algorithm to find out the best conformation of each side chain from its rotamer library, taking solvation into account, and (ii) to perform rigid-body minimization of the intermolecular interaction energy on the interface region only, during which the larger protein is fixed. Zacharias 5 introduced side-chain flexibility in ATTRACT by using a rotamer library that contained no more than three pseudo atoms for each amino acid. 6 In RosettaDock, Wang et al 7 used a discrete rotamer library that was complemented by the side-chain conformation of the unbound state and consequently executed continuous optimization of side chains in the vicinity of the rotamers.
Prediction of the side chain can be converted into an optimization problem that aims to discover the combination of rotamers of all residues in the involving motifs to achieve the global minimum. 8 This optimization problem has been found to be NP-hard. Though many algorithms adopt the strategy of restraining the topologies of the residues to simplify and resolve this problem, it is still difficult to get a unique solution. Many factors can affect the stability of 3D structure of a protein complex, to name a few, physiochemical complementarity such as electrostatic complementarity, Coulomb potential, hydrophobicity (HP), Lennard-Jones potential, and so on. 9 Meanwhile, from the perspective of geometry, the two docking proteins usually exhibit geometric complementarity, with the surfaces of the involving motifs matching each other tightly. Therefore, geometric complementarity should be used in this regard.
In this study, we sample the possible side-chain conformations by using the rotamer library to improve local geometric complementarity. Our unique strategy includes ensembles of structures for the flexible patches associated with the motifs of the receptor and ligand considered and a fast docking method for achieving a list of docking candidates based on their ensembles.
Side-Chain Flexibility in Motifs and the Algorithm
Applying Morse theory to flexible protein surface segmentation
Structural information of a protein molecule is available in the database Protein Data Bank (PDB), especially the 3D coordinates of all the atoms that constitute the protein. We base the geometric complementarity on atomic representation of the residues lying on the surface. A mathematical model has been developed to describe the protein surface with a sparse distribution of the atoms on the protein surface. We start with the solvent-excluded surface (SES) extraction, as shown in Figure 1.

SES extraction using a probe sphere: the atoms are represented with the spheres of different van der Waals radii; the original protein molecule is made up of a list of overlapping spheres.
In the model, the atoms are represented with the spheres of different van der Waals radii. The original protein molecule is assumed as a set
An SES is continuous but may contain singularities. For the segmentation purpose, we simplify the representation to a triangular mesh. This algorithm has been implemented, as shown in Figure 2, where the surface is extracted and triangulated over the macromolecule 4PTI.pdb.

SES is extracted from
Upon acquisition of the triangular mesh of the SES, it is segmented into patches in order to do geometric matching between the protein surfaces. Let
To apply the Morse theory, let
An initial segmentation will decompose the surface coarsely into three types of surface regions, namely, concave, convex, and flat. Both the Gaussian and mean curvatures are calculated for all vertices
The three different types of regions are further decomposed into smaller patches. During this procedure, we extract a list of critical points and then around each critical point create a surface patch that contains all the surface points whose geodesic distance from the critical point is less than the experimental value 16 Å.
A local coordinate space for each vertex
Upon the Gaussian and mean curvatures computed on the mesh, we utilize the criteria proposed by Besl and Jain 13 to label each triangle on the mesh. concave, convex, or flat. 14 The peak, ridge, and saddle ridge are considered as the convex type; the flat and minimal surfaces are contained in the flat type, and finally, the concave type consists of pit, valley, and saddle valley.
Finally, we have developed a region-growing algorithm to decompose the entire triangular mesh into convex, concave, and flat regions. Each triangle and its neighboring triangles are segmented into the same surface type. The algorithm is described in the following pseudocodes.
Algorithm: Region Growing in the Three Types of Patch
Initializing the segment number seg_id of each triangle in TriArray to −1, assign the current segment number id=0;
Mark the surface type for all the triangles in TriArray;
In the procedure, the region is expanded in all directions around a critical point 15 until it reaches the region contour of the surface, resulting in an elementary surface patch. To show the result using the protein 1CGI_ligand.pdb, in Figure 3A, the yellow points represent the critical points on the molecular surface and in Figure 3B, the protein surface regions are segmented into three types of patches. convex (in green), flat (in yellow), and concave (in pink).

Elementary surface patch using the protein 1CGI_ligand.pdb:
Extraction of involving motifs associated flexible interaction sites
Based on the segmented surface, we can now describe the protein with an ensemble of conformations that incorporate the flexibility of interface side chains and are sampled using rotamers. As will be further discussed in the next section, the highest geometric matching score is assumed to correspond to the closest conformation in the actual bound state. Therefore, we focused on reduction of the number of possible side-chain conformations so as to get a sample of rational size. The main processes are as follows:
As protein docking is dominated by short sequences of amino acids, we use these sequences to control the conformational deformations during binding. The atoms in the short sequences constitute the contact surfaces between the two docking proteins. Our algorithm identifies the flexible contact surface as well as the amino acids on the contact surface, referred as interaction sites.
In contrast to the other amino acids in the protein surfaces, the interaction sites have some unique properties that facilitate binding the two proteins together. In our algorithm, the priority in identification of the interaction sites is for the physicochemical and geometric complementarity, such as the cavities of the surface. Hence, instead of a full computation of the conformation of the whole molecular surface, we mainly compute the complementarity of the side chains on the interface sites between the two docking proteins.
Next, as the interface region is small, so it is sufficient to describe the conformational deformations within the local surface regions. We divide the interface sites into small molecular surface patches that contain eight or nine residues in practices. During the docking process, a set of patches is selected to lay over the entire molecular surface. We compare our sampled conformations with the side-chain conformations in the bound complex.
To incorporate the flexibility into the interaction sites, various conformations of the side chains of the residues contained in the patch are generated. To estimate the conformations of side chains, we also consider the conformational space of the side chains as discrete, and assume each discrete conformation as a rotamer. The rotamers are then grouped into the rotamer library.
Furthermore, the algorithm identifies the best combination of rotamers that corresponds to the lowest-energy state. A through listing of all possible rotamer combinations of the amino acids included in one surface patch eventually results in a few flexible conformations.
Finally, to avoid only adopting the 3D conformations with the lowest energy in the combination, we also consider a better sampling of the conformational space approachable by the patch. Our experiments show that conformations with similar low energies show few differences around the same local minimum. Hence, in order to achieve a broader sampling, we subdivide the conformational space of each residue into three parts based on its three common c1 torsion angles (60°, 180 °, and −60°).
Figure 4 presents the key steps in extraction of the flexible interaction site for the 1CGI complex obtained from PDB. Figure 4A gives an overview of the conformation, and Figure 4B reveals the interaction sites (ribbon rendering) in the midst of the running algorithm. Figure 4C shows the extracted ligand patch (surface shading), and Figure 4D gives the identified interaction site from the ligand. In this example, we assume that the patch includes one tryptophan that has 6 rotamers, two aspartic acids that each has 3 rotamers, one glutamine that has 28 rotamers, one lysine that has 8 rotamers, one arginine that has 6 rotamers, and two serines that each has 3 rotamers. Therefore, the flexible surface has a total of 6 × 3 × 3 × 28 × 8 × 6 × 3 × 3 = 653,184 possible conformations.

Extraction of the interface sites for the 1CGI complex:
In practice, we can sample configuration of every possible torsion angle cl of the patch residues. As mentioned in the algorithm descriptions, we assume that there are about eight or nine residues on each patch. For example, take the patch that has nine residues. Each residue has no more than three cl angle possibilities. The amount of cl configurations reaches to 39 = 20,000. Rather than fetching the lowest-energy samples out from all the conformations, we pick out the lowest-energy structure of each of the about 20,000 cl configurations as follows:
Fix the cl configurations and choose the torsion angles c2 and c3 in the patch by energy minimization. The energy function involves the van der Waals energy, Coulomb energy, and so on.
Apply the clustering methods, such as
Select one configuration that has the lowest-energy conformation from each cluster, and put it into the final sample for the patch.
Spherical harmonic descriptor (SHD)-based surface matching and alignment
With the flexible surface representation, we are able to get all the possible relative positions of the two docking monomers. The 3D conformations of protein complexes indicate a close geometric match on the interface sites. Hence, geometric matching plays a significant role in the docking. As described in the previous sections, our flexible docking method is based on local-shape feature matching. Surface complementarity is largely the geometric similarity matching. The key is how to extract and describe the shape features for the similarity matching, considering that proteins can have the same representation with their conformation after a geometric transformation.
Two methods can be considered. 3D protein structures are normalized by applying a canonical transformation, or the 3D structures are described by a transformation invariant descriptor. In general, protein structure can be normalized by moving its mass center to the origin of the coordinate system for translation and by setting main axes for rotation. Existing methods are robust for translation normalization but not rotation normalization. Therefore, docking is better to be based on local-shape feature matching by a 3D shape descriptor. Based on the concept of the SHD, 16 we propose a rotation invariant protein structure descriptor. The main processes are as follows.
First, the surface patch is represented with a three-dimensional matrix
where
The surface patch is translated so that the centroid is located at the point (0, 0, 0), and thus, the radius of the bounding sphere is
Then, the Cartesian coordinates (
With different values of radii, a set of spherical functions {
Assuming
By separating the variables, two differential equations can be calculated by imposing Laplace's equation:
We can now obtain the spherical harmonic function by further separating the variables:
Based on the spherical harmonics, the spherical function can be represented with the amount of energy it has at different frequencies. We first decompose the spherical function into its spherical harmonics, then summarize the harmonics within each frequency, and finally, compute the norm of each frequency component as follows:
Let υ1 be the subspace of the spherical harmonics. Then, we have
For any
For a spherical rotation, we have the property:
The energy at each frequency I is rotation invariant:
From the properties of the spherical harmonics, we know that the
The above expression is independent of the orientation of the spherical function, so we get
With the above derivation, we get the rotation invariant feature for each surface patch, which is given as
In this way, to measure the similarity between two surface patches, we calculate the Euclidean distance between the two corresponding SHDs, which is as follows:
In our study, a protein surface is segmented into three types of patches: convex, concave, and flat. We use the derived rotation invariant 3D shape descriptor to get the signature for each patch. By comparing the signatures, we are able to filter out the patch pairs with the most similarities. As illustrated in Figure 5, each convex patch from the receptor will be matched with all the concave patches of the ligand and vice versa.

Pairwise similarity between convex patches of ligand with concave patches of receptor and vice versa: the derived rotation invariant 3D shape descriptor is used to get the signature for each patch.
During the process of alignment, a transformation matrix for the ligand is computed by using the iterative closest point (ICP) algorithm. As shown in Figure 6, the red points and blue points, respectively, belong to the point sets A and B for registration. As each surface patch is composed of a set of points, ICP is used for geometric alignment of the two 3D protein surface patches, one from the receptor that is fixed and the other from the ligand that is transformed. In our experiments, as the area of the surface patch is fairly small, we set the iterative parameter as 10. The final transformation matrix is applied to transform the ligand protein, which, in turn, binds the receptor protein to form the candidate structure.

Registration of two different point sets using the ICP algorithm: the red points and blue points, respectively, belong to the point sets A and B for registration. ICP is used for geometric alignment of the two 3D protein surface patches, one from the receptor that is fixed and the other from the ligand that is transformed.
Scoring of complementarity
Upon complementarity matching, a list of candidate complex conformations will be generated. We have to select the near-native complex conformations from the docking candidates, known as scoring or ranking in protein-protein docking.
Owing to the fact that most of the energy obtained upon complex formation is derived from the hydrophobic effect and the hydrophobic forces are short ranged, the two docking proteins should have short distances between interaction sites. This is in correspondence with the premise that the two docking monomers should exhibit corresponding radii of curvature on macroscopic and microscopic scales on their surfaces. We can accordingly build the complementarity of the confrontation of concavities on one side and convexities on the other side in the interaction sites.
The scoring method used in our work is to partition the receptor into several shells based on the distance from the protein surface. Each shell is determined by a range of distances in the 3D distance grid. Each conformation of the protein–protein docking candidates is scored by a weighted function of the number of ligand surface points in each shell. Briefly, the scoring function is computed after transformation and alignment, while the SES of the ligand enters into the 3D distance grid of the SES of the receptor. Each surface point of the ligand is assigned with a value according to its distance from the surface of the receptor. The scoring function is defined as
In this equation,
While this method can give an accurate geometric scoring for the 3D structure of the docking candidates, its computational time grows fast with the increment of size and resolution of the surface of the ligand. Multiresolution can be considered. In our scoring system, a low-level mesh resolution with the point density of one point per angstrom and a high-level mesh resolution with the point density of four points per angstrom have been utilized in the scoring function. First, the low-resolution mesh surface is applied to all the docking candidates, and only the 3D conformations with the highest scores are extracted to be further filtered by using their high-resolution mesh surface.
In the final step, we calculate the RMSD between each candidate structure and native structure of the corresponding complex. The results are filtered into high accuracy (RMSD < 2.5 Å) and medium accuracy (RMSD < 5 Å). In our experiments, there are two thresholds 2.5 Å and 5 Å. If the distance is less than the threshold, we call the candidate structure a hit.
Computational Experiment Results
To assess the performance of our flexible docking system, FlexDock, a public dataset
The experimental results of our FlexDock are compared with those of the other two docking systems. PatchDock is local patch-based searching method where the SES is generated, 18 while ZDock is a fast Fourier transform-based global searching method where the surface residues are extracted. 19 As in the ZDock system, when the sampling angular is set to 15°, the total amount of conformations predicted is limited to 3600. For a fair comparison, only the top 3600 docking candidates were considered for all the three systems.
In the experiments, we set the threshold to 2.5 Å. For each system, the rank of the first hit within the top 3600 candidates is described for all the test cases in the Benchmark v2.4. Owing to the space constraint, only the top 84 hits of this large table are given in Table 1. These hits do not necessarily correspond to the smallest RMSD value. We found that our FlexDock identified 70 out of 84 hits in the top list of the database, which is much more than 45 and 57, respectively, from the other two systems.
Comparisons: rank gives the first hit within the top 3600 predictions with a threshold of 2.5°Å (only top 84 are shown).
To assess the execution performance, we looked into the different stages of the docking tasks. In the first process, the molecular surface of the receptor and ligand is first extracted and segmented into convex and concave patches, respectively. We can see that the number of atoms and the corresponding segmented surface patches for each complex in the Benchmark v2.4 increase proportionally to the increment in the atom number of the two docking proteins. The average running time for preprocessing and other steps during docking is described in Table 2. The experiments are conducted using a CPU with a quad-core 3.20 GHz processor and 4 GB RAM.
The average running time for each step.
The first process consists of extracting and segmenting the protein surface. The average time cost in this step is 29.06 seconds for each pair of receptor and ligand. This task can be completed offline.
The time to calculate the descriptor for each patch is 0.49 seconds. In the experiment, the average patch number for each pair of the interacting proteins is 2673. Therefore, the average descriptor extraction time for the two docking proteins is 1309 seconds.
The comparison of descriptors is to calculate the distance between the two signatures of each patch pair as a value in the range [0.0, 2.0], with small values corresponding to two similar patches. The patch pairs are sorted based on the distance value from smallest to largest. Only the first several patch pairs will be used for aligning the receptor and ligand. The average time for comparison of descriptors between each pair of patches is 0.001 seconds, which is nearly 100 times faster than the scoring task. Therefore, the comparison of descriptors can be used as a prescoring and filtering tool. This can help to accelerate the docking procedure.
Overall, the average running time of the three systems for all the test cases in Benchmark v2.4 is given in Table 3. While our FlexDock is able to identify a hit for more cases than the other two docking systems, it runs faster than ZDock but slower than PatchDock. This is mainly because, for flexible surface docking, the number of segmented patches for each pair of receptor and ligand is much larger than those of the other systems.
The average running time of each system.
Discussions and Conclusions
The influence of the side-chain flexibility in motifs has been analyzed, and it has been added into the local surface patches for a soft surface representation. The flexibility of the side chains has also been implemented by the rotamers. It is helpful to improve the accuracy of the docking algorithm. We summarize the procedure in our proposed algorithm as follows:
Applying the Morse theory to the flexible protein surface segmentation: The region is expanded in all directions around a critical point until it reaches the region contour of the surface, resulting in an elementary surface patch. The surface regions are segmented into three types of patches: convex, flat, and concave.
Extraction of the involving motifs: Extraction of the flexible interaction site includes first the conformation, the progressive interaction sites, the extracted ligand patch, and the identified interaction site from the ligand.
Surface matching and alignment: All the possible relative positions of the two docking monomers are acquired. Based on the SHD, a rotation invariant protein structure descriptor is utilized. During the process of alignment, a transformation matrix for the ligand is computed by using the ICP algorithm. The final transformation matrix is applied to transform the ligand protein and to bind the receptor protein to form the candidate structure.
Scoring of complementarity: A list of candidate complex conformations is generated to select the near-native complex conformations from the docking candidates. A low-level mesh resolution with the point density of one point per angstrom and a high-level mesh resolution with the point density of four points per angstrom are utilized in the scoring function. If the distance is less than the threshold, the candidate structure is a hit.
Innovation in our algorithm designs benefits protein docking. First, in our model, we extract the SES of the protein molecules and segment it into concave, convex, and flat patches. The atoms are represented with the spheres of different van der Waals radii, and the original protein molecule can be represented in a list of overlapping spheres. The Morse theory is applied for analyzing the topology of the protein molecule surface by studying the differentiable functions on the surface. We are able to describe the protein with an ensemble of conformations. The flexibility of interface side chains is incorporated by sampling their conformations using rotamers, in which the matching scores correspond to the conformation in the actual bound state. Eventually, we are able to reduce the number of possible side-chain conformations so as to get a sample of rational size.
For the surface patch alignment, we propose a transformation matrix for computing the ligand. With each surface patch being composed of a set of points, ICP shows its usefulness in geometric alignment of the two 3D protein surface patches. The final transformation matrix can be applied to transform the ligand protein, which, in turn, binds to the receptor protein to form the candidate structure.
In candidate complex conformations, because the energy is most derived from the hydrophobic effects and the forces are short ranged, we are able to build the complementarity with concavity on one side and convexity on the other side.
While our proposed flexible docking method exhibits its advantages in identification of complementary candidates, we have not taken into considerations physicochemical factors during the scoring process. In the future, we will combine the geometrical and physiochemical factors, such as sequence conservation, HP, interface residue propensity, electrostatic potential and so on, to filter out the surface patches with binding sites.20,21
