PROTEINS: Structure, Function, and Genetics, Suppl. 1:68–73 (1997) Analysis of Comparative Modeling Predictions for CASP2 Targets 1, 3, 9, and 17 Robert W. Harrison, Charles C. Reed, and Irene T. Weber* Department of Microbiology and Immunology, Kimmel Cancer Center, Jefferson Medical College, Philadelphia, Pennsylvania ABSTRACT Comparative modeling targets 1, 3, 9, and 17 were predicted by alignment of multiple sequences and structures, when available, followed by minimization using the program AMMP. The minimization used improved potentials, and distance restraints for regions of common structure. New prediction procedures were evaluated. Three tested solvent corrections did not significantly improve the predictions. Target 17 had 85.3% sequence identity with the parent and no insertions or deletions. The prediction had a root-meansquare deviation from target 17 of 0.56 Å on Ca atoms, and 0.59 Å for the ligand atoms, which verified the accuracy of the minimization. Targets 1, 3, and 9 had 36.4%, 46.7%, and 33.3% identity with the parent sequences, and predictions resulted in root-mean-square deviations for 79–85% of Ca atoms of 1.49, 1.11, and 1.24 Å, respectively. Conformational differences between parent and target crystal structures were difficult to predict. The use of distance restraints and multiple structures improved the positioning of gaps in sequence alignment. Distance restraints did not overcome errors in sequence alignment or ambiguities due to conformational variation in proteins. Predictions for targets 3 and 9 successfully reduced large deviations between parent and target structures. Proteins, Suppl. 1:68–73, 1997. r 1998 Wiley-Liss, Inc. Key words: homology modeling; energy minimization; distance restraints; protein structure; prediction errors INTRODUCTION Comparative modeling to predict the structure of one protein from sequence similarities with a related protein of known structure can be used to design protein engineering experiments, to predict how ligands bind to proteins, and to help solve experimental structures in solution or in the crystal. The Critical Assessment of Structure Prediction Methods 2 (CASP2) was valuable for objective evaluation of new prediction procedures by comparison of predicted and target structures. The results of comparative modeling in the 1994 CASP experiment and r 1998 WILEY-LISS, INC. previous comparisons have shown that it is very difficult to predict the structure of regions with low sequence similarity, insertions or deletions.1–4 However, functionally conserved ligand binding sites can be predicted with relatively high accuracy.2,4 For the CASP2 experiment, we incorporated the improved potentials described in ref. 5, and tested three simple solvent corrections, the use of multiple crystal structures for sequence alignment, and application of interatomic distance restraints during minimization. The proteins that were predicted by comparative modeling were target 1, dihydrofolate reductase from Haloferax volcanii, target 3, phosphotransferase enzyme IIA domain from Mycoplasma capricolum, target 9, cucumber stellacyanin, and target 17, rat liver glutathione transferase. Targets 1, 3, 9, and 17 had sequence identities of 36.4, 46.7, 33.3, and 85.3% with parent crystal structures 1dyh, 1gpr, 2cbp, and 2gst from the Protein Data Bank,6 respectively. Several related structures were available for targets 1 and 3, and these structures were used to provide distance restraints for a ‘‘core structure’’ of distances predicted to be conserved. Target 9 was predicted from just Ca atoms as a test using the 1cbp parent structure and from all atoms in the 2cbp crystal structure. Target 17 was predicted in complex with the substrate glutathione. Differences between predicted and target structures will include both errors in the potentials and modeling procedures and errors in the protein crystal structures. A prediction cannot be more accurate than average crystallographic precision, where recent comparisons of different crystal structures of identical proteins have found a mean value of 0.40 Å for the root-mean-square deviation (RMSD) of Ca atoms7 and a range of 0.16–0.79 Å for main chain atoms.8 The errors in the predicted structure will not have a normal distribution. This nonnormality will be especially pronounced when a robust optimizer is used. Where the distance terms correctly describe the geometry, the errors should be normal, with a variance reflecting the average error in the re- *Correspondence to: Dr. Irene T. Weber, Department of Microbiology and Immunology, Kimmel Cancer Center, Jefferson Medical College, Philadelphia, PA 19107. Received 1 May 1997; Accepted 25 August 1997 69 COMPARATIVE MODELING PREDICTIONS FOR CASP2 TABLE I. Summary of Predictions: RMS Differences for Conserved Regions (Å)* Target 1† A B C D CM123 CM124 CM122 Target 3† Ca All atoms 1.49 1.54 1.49 2.30 2.33 2.22 CM144 CM145 CM146 CM176 Target 9‡ Ca All atoms 1.11 1.13 1.09 1.10 1.77 1.91 1.82 1.90 CM325 CM322 Target 17§ Ca ALL ATOMS 1.24 2.47 1.87 4.07 CM519 CM525 Ca All atoms 0.56 0.52 0.59 1.03 0.98 Ligand *Conserved regions were defined by using a cutoff of 3.0 Å for target 1, 2.5 Å for targets 3 and 9, and no cutoff for target 17. Analysis of all atoms is presented in ref. 14. †A, B, C, D test different solvent corrections for targets 1 and 3. Prediction A has no solvent correction, B used an increased van der Waals radius for polar side chains, C used a single-shell discrete water model for polar residues, and D used a two-shell discrete water model for polar residues. ‡For target 9, A is the prediction from all atoms (2cbp) and B is from Ca atoms only (1cbp). §For target 17, A and B are two subunits of the predicted dimer. TABLE II. Comparison of Predicted, Parent, and Target Structures Crystal structures* Target #Res Parent %ID #Ins Parent/ target‡ 1 3 9 17 162 154 108 217 1dyh 1gpr 2cbp 2gst 36.4 46.7 33.3 85.3 4 2–3 5–7 0 1.25 (82.3) 1.07 (82.5) 0.93 (81.4) 0.46 (100) RMSD Å (%Ca)† Parent/ predict§ Predict/ target¶ 0.91 (95.7) 0.67 (98.7) 0.74 (78.7) 0.40 (100) 1.49 (84.8) 1.11 (85.1) 1.24 (79.3) 0.56 (100) *Res is the number of residues in the target protein. The PDB entry for the parent crystal structure is given with the percentage of identical residues (%ID) compared to the target sequence and the number of insertions or deletions (#Ins). Structures 1dyh, 1dhf, 1dr1, 3dfr and 5dfr were used for sequence alignment and distance restraints during minimization of target 1, and 1gpr, 1gla, and 1f3g were used for target 3. †RMSD is the root-mean-square deviation for the percentage of Ca atoms in parentheses using cutoff values as in Table I. ‡Parent/target compares the parent and target crystal structures. §Parent/predict compares the parent and predicted structures. ¶Predict/target compares the predicted and target crystal structure. straints (this follows from the central limit theorem of statistics applied to an expectation value). However, where the distance restraints and potentials are ill posed, due to an alignment error or where insufficient structural data are available at a loop, the results will depend on the specific error resulting in a nonnormal distribution. When the distance terms are only locally valid, as in loops, the local superposition of structures will have low variance, but the global superposition will be poor. Therefore, the total distribution of errors will have at least three components: the normally distributed error in good regions, errors distributed around a different mean in locally good regions, and systematically incorrect regions. METHODS The protein sequences were aligned by using the sequence analysis package of the Genetics Computer Group. Multiple sequence alignments were made for targets 1 and 3. The initial alignment was adjusted manually to position insertions and deletions at surface turns between elements of secondary structure. The parent structure was chosen to have the highest identity and the fewest gaps relative to the target sequence. The initial model for the target structure was obtained by combining the sequence with the parent structure as described previously.2 All atoms of identical residues and identical atoms in different residues were kept. Solvent and common ligands from the parent structure were included because these are often structurally important. Distance Restraints Interatomic distances between 3 and 8 Å were used as restraints for the atoms expected to be identical in parent and target structures, as in Weber and coworkers.9 Distance restraints were implemented with a split harmonic potential similar to that used for nuclear Overhauser effect data. The upper and lower bounds were determined from the observed range of distances in multiple crystal structures or set to distance 60.5 Å. Peptide geometry was restrained toward trans with a 3.148 Å restraint on the O = HN distance. For multiple crystal structures, an intersection set of distance restraints was generated by using interatomic distances found in all structures. A union set of all distances was generated and used with a lower weight. 70 R.W. HARRISON ET AL. Solvent Corrections Models include all solvent molecules available for the parent crystal structure. Three different solvent corrections were tested for prediction of targets 1 and 3: an increased van der Waals radius for polar side chains, and discrete water molecules in one or two shells around polar amino acids. Since water is present in high molar excess to any individual amino acid in the protein, the ability of a charged amino acid to make salt bridges is reduced by competition with bound water. The van der Waals radius for hydrogen bond acceptors and donors was increased by 3 Å to reflect the average presence of one water. Discrete models were also generated by attaching one shell and two shells of water to individual hydrogen bond donors and acceptors in amino acid side chains. These models for solvation do improve local geometry and preclude spurious buried residues in test systems, but had little effect on overall error. Minimization The new atoms and all hydrogen atoms were built using AMMP10 minimization, while keeping the identical atoms fixed, as described.2 AMMP uses all atoms and all nonbond and electrostatic terms, and an improved potential set.5 The entire structure was minimized with distance restraints for targets 1, 3, and 9. Distance restraints for multiple crystal structures were applied for targets 1 and 3. Manual adjustment was used to place residues F21 and Y22 internally in one insertion of target 9. No distance restraints were used for target 17. The robust optimization technique of four-dimensional (4D) embedding was used in order to fully test the effects of distance restraints. If a local technique, like conjugate gradients, were used, then it would be impossible to differentiate between errors due to improper convergence and errors due to improper distance terms. 4D embedding was described for CASP.2 4D embedding was first proposed for solving nuclear magnetic resonance (NMR) distance restraint problems,11 and the implementation in AMMP readily solves these problems. RESULTS AND DISCUSSIONS The predictions for targets 1, 3, 9, and 17 are listed in Tables I and II. New methods were tested by submitting several predictions for the same target. The accuracy of minimization was tested for target 17. Solvent corrections were tested for targets 1 and 3. Prediction starting from just Ca atoms was tested for target 9. Several related crystal structures were used in multiple sequence alignment to aid in positioning gaps, and to provide interatomic distance restraints for predictions of targets 1 and 3. The predictions were compared with both the target and parent crystal structures for the overall RMSDs. Target 17 was very similar to the parent 2gst with 85.3% sequence identity and no insertions or deletions. The predictions had RMSDs from target 17 of 0.52 and 0.56 Å on Ca atoms. Targets 1, 3, and 9 had 36.4, 46.7, and 33.3% identity with the parent sequences, and the predictions resulted in RMSDs for 79–85% of Ca atoms of 1.49, 1.11 and 1.24 Å, respectively (Table II). Interestingly, although the parents and targets are related by 33–47% sequence identity, the pairs of related crystal structures all have 81–83% structurally conserved residues with RMS differences of 0.93–1.25 Å. The structures are more highly conserved than the sequences, with differences in the observed range for the sequence identity.7 However, target 9 is the most similar structurally to the parent 2cbp, although the sequence identity of 33% is the lowest. The tested procedures were evaluated for their success in reproducing the target structures in both conserved and variable regions. Effect of Minimization The prediction for target 17 was the most straightforward, since the target and parent sequences had the high-sequence identity of 85% with no gaps. The prediction tested the standard method with all atoms, solvent and ligand from the parent structure2 and improved potentials from ref. 5. Because it was not known if inhibitor was bound to target 17, the glutathione substrate was modeled in the binding site, as described for nonprotein ligands.2 The two subunits in the dimer were entered as submissions CM519 and 525. This prediction showed the accuracy of the procedure when the target is almost identical to the parent structure with an RMSD of 0.46 Å for Ca atoms (Table II). The RMS differences between predicted and target structures were 0.52 and 0.56 Å for Ca atoms, 0.57 and 0.60 Å for main chain atoms, and 0.98 and 1.03 Å for all atoms, for the two subunits, respectively. These differences are within the experimental errors observed for different crystal structures of identical proteins, where recent comparisons have found a mean value of 0.40 Å for the RMSD of Ca atoms7 and a range of 0.16–0.79 Å for main-chain atoms.8 The deviations of side chain atoms were more variable; RMSDs of 1.32–1.68 Å were reported for different crystal forms of bovine Fig. 1. Structural alignment of C-terminal sequences of target 3, prediction 3, parent 1gpr, 1f3g, and 1gla structures. Residues 142–154 of target 3 are shown. Asterisks indicate gaps. COMPARATIVE MODELING PREDICTIONS FOR CASP2 71 Fig. 2. Comparison of prediction (green), 1gpr parent (yellow) and target 3 (red) in region of termini. The Ca atoms of residues 11–16, 139–145, and 153–159 of target 3 are shown with corresponding regions of 1gpr and the prediction. The predicted structure has moved away from the parent and toward the target structure for these 3 adjacent strands. pancreatic trypsin inhibitor.12 The common atoms of the ligand had an RMS difference of 0.59 Å and the ligand binding site had differences of 0.44 Å for Ca atoms and 0.76 Å for all atoms, which suggests that ligands and their binding sites can be predicted with relatively high accuracy, as previously noted.2,4 The close agreement between prediction and target, and success in the docking predictions,13 verify the accuracy of the potentials and minimization procedure in reproducing closely similar protein structures. Effects of Solvent Corrections Solvent corrections were tested for targets 1 and 3 to ensure that polar side chains were directed to the surface of the protein. Inappropriate placement of polar side chains within the protein was observed in the previous CASP experiment. The models for solvation improved local geometry and precluded spurious buried residues in test systems, but had little effect on overall error. Three predictions were submitted to test two solvent corrections for target 1. CM123 had no correction, CM124 used an increased van der Waals radius for polar side chains, and CM122 used a single shell discrete water model for polar residues. Four predictions were submitted for target 3 to test three different solvent corrections. CM144 had no correction, CM145 used an increased van der Waals radius for polar side chains, CM146 used a single shell discrete water model for polar residues, and CM176 used a two-shell discrete water model. For both targets, the predictions were very similar to each other for Ca and all atoms, and there was no significant improvement in the agreement with the target structure (Table I). Prediction From Only Ca Atoms Target 9 had 33% sequence identity with a single parent represented initially by Ca atoms only (1cbp) and later all atoms (2cbp). Predictions were made from both 1cbp and 2cbp to test the effects of starting from Ca atoms only. Not unexpectedly, prediction from Ca atoms was significantly worse at 2.5 Å RMSD from the target Ca atoms, compared to 1.2 Å for the prediction using all atoms (Table I). The 72 R.W. HARRISON ET AL. TABLE III. Analysis of Insertions, Deletions, and Variable Regions* Target Region type† Residues 1 Variable Insert 24 Variable Variable Insert 73–74 Insert 92–93 Variable Delete 132/133 Delete 144/145 Delete 145/146 Variable Insert 20–24 Insert 64, 66 16–18 22–24 47–50 54–58 67–75 90–93 1–9 131–138 3 9 Fig. 3. Difference distance plot for target 3. The difference is the distance between pairs of Ca atoms in the prediction and target minus the distance in parent and target. This difference is plotted against the distance between the parent and target for the same atoms. Negative values indicate that the prediction is closer to the target than is the parent. Three effects are seen in this plot. The normally distributed errors in the structurally conserved core are clustered around 1 Å. Many of the large distances between the parent and the target are shortened, but a streak of small differences (around 20.5) extending to 12 Å along the x-axis reflects the effect of incorrect distance restraints. This ‘‘streak’’ was absent from the equivalent plot for target 9, where only a single-parent structure was available. agreement for all atoms also increased to 4.1 Å from 1.9 Å comparing predictions from only Ca atoms and from all atoms. Errors in Sequence Alignment Several structures were superimposed to give the initial sequence alignment for targets 1 and 3. The insertions and deletions in the sequence alignment for target 1 were correctly positioned by using structures 1dyh, 1dhf, 1dr1, 3dfr, and 5dfr, except for a misplacement of the C terminus. The predicted two-residue deletion was actually a longer sevenresidue deletion between the target and parent, which was not obvious from the aligned sequences. Similarly for target 3, the superposition of the structures of 1gpr, 1f3g, and 1gla allowed correct positioning of the gaps, except near the C terminus. The predicted C terminus was mispositioned by one residue compared to the correct structure. The correct position cannot easily be deduced from the multiple alignment (Fig. 1). Target 9 was predicted from one parent structure, and there were errors in the predicted positions of gaps. One stretch of 14 residues was misplaced by one residue due to the presence of only two identical residues in both the predicted and actual alignments. These errors would probably not have occurred if several related structures were used. Therefore, the superposition of several related structures helped to position inser- 152–154 15–24 63–71 RMS differences (Å)‡ Overall Local 5.00 3.06 2.36 4.58 5.26 3.85 6.96 4.58 Not flagged Not flagged 4.33 3.48 6.99 0.19 0.48 0.11 1.14 2.98 0.17 3.47 1.94 0.05 3.47 3.45 *Regions of dissimilar structure that were flagged in the automatic comparisons for CASP2. †Insertions (Insert), deletions (Delete), and Variable regions are indicated. ‡RMSDs on Ca atoms are noted for the superposition of the whole structure and for a local superposition of the region. tions and deletions, but did not always result in the correct alignment. Comparison of Prediction, Parent, and Target Structures One measure of success is whether the prediction has reduced the conformational distance between parent and target crystal structures. The predictions without solvent correction were analyzed in comparison with parent and target structures (Table II). Overall the predicted structures were close to the parent structures with RMSD of 0.40–0.91 Å. However, targets 1 and 3 included 3% more atoms in the superposition with the predicted structure than with the parent structure. This increase suggested that the overall agreement between predicted and target structures was improved. For target 3, concerted changes in position of the two termini and strands 139–145 showed that the prediction had moved closer to the target in regions of larger discrepancies (Fig. 2). Therefore, the distances between pairs of Ca atoms were plotted to show if the prediction was closer than the parent to the target structure. The differences in separation of Ca atoms for prediction– target and target–parent were plotted against the target–parent differences for the same atoms (Fig. 3). The negative values clearly confirm that the prediction is closer to target 3 than is the parent. Improvement is seen especially for atoms with differences of 4–8 Å between parent and target crystal structures. These atoms were predicted up to 5 Å closer to the target, which is a significant improvement over the parent structure. This success was partly due to the use of distance restraints from multiple structures. Similar improvements were ob- COMPARATIVE MODELING PREDICTIONS FOR CASP2 served in the difference plot for prediction of target 9, but these must arise from minimization because only a single parent structure was used. By contrast, the prediction for target 1 showed no improvement over the parent structure, despite the use of distance restraints from five crystal structures. 2. Analysis of Loops The regions with larger differences between prediction and target were analyzed by the automated procedure for CASP2, which performed an overall superposition and a local superposition of residues in ‘‘loops.’’ Results for targets 1, 3, and 9 are summarized in Table III. In the prediction for target 9, a five-residue insertion (20–24) was deduced to be a helix, and distance restraints were used to enforce helical conformation. Unfortunately, the predicted helix was left-handed because the distance restraints did not distinguish the handedness (the residues were correctly L-amino acids). For targets 1 and 3, the low RMSDs of 0.05–0.48 Å for the local superpositions of most loops suggested that the correct conformation had been predicted. The residues 144–146 of target 3 that were modeled with two single-residue deletions were not flagged as conformationally different suggesting successful prediction. In both targets 1 and 3, more than half of the flagged ‘‘loop’’ regions did not involve insertions or deletions. Therefore, differences at gaps were indistinguishable from conformational differences in variable regions. Effects of Distance Restraints Two fundamental problems with distance restraints were revealed. First, because they are entirely empirical, distance restraints bring no new information to the problem. Distance restraints improved the prediction when similar distances were present in the target. They did not overcome errors in sequence alignment or ambiguities arising from local variations in protein structures. Second, while distance restraints are good at preserving correct features of the model, they are equally good at preserving incorrect features. Incorrect features result from sequence alignment errors, or when conserved secondary structure elements are shifted in space with respect to each other. Therefore, careful choice of distance restraints is important in regions with low homology. Distance restraints should not be used in regions with little sequence similarity or insertions and deletions. Distance restraints, while useful, are no replacement for a fundamentally better treatment of a solvated charged protein. CONCLUSIONS 1. The AMMP potentials and minimization procedure result in predictions that agree with the 3. 4. 5. 73 target within the experimental errors observed for protein crystal structures when the sequence identity is high. The positioning of insertions and deletions is improved when several related structures are superimposed to obtain the best alignment with the target sequence. But, multiple sequence alignment does not always result in the correct positioning of the target sequence. The simple solvent corrections that were tested did not significantly improve the predictions. Distance restraints can improve the prediction for structurally conserved regions, but should not be used in regions with little sequence similarity or insertions and deletions. Two predictions have reduced the large deviations between parent and target structures. REFERENCES 1. Mosimann, S., Meleshko, R., James, M.N.G. A critical assessment of comparative modeling of tertiary structures of proteins. Proteins 23:301–317, 1995. 2. Harrison, R.W., Chatterjee, D., Weber, I.T. Analysis of six protein structures predicted by comparative modeling techniques. Proteins 23:463–471, 1995. 3. Greer, J. Comparative modeling methods: Application to the family of mammalian serine proteases. Proteins 7:317– 334, 1990. 4. Weber, I.T. Evaluation of homology modeling of HIV protease. Proteins 7:172–184, 1990. 5. Weber, I.T., Harrison, R.W. Molecular mechanics calculations on HIV-1 protease with peptide substrates correlate with experimental data. Protein Eng. 9:679–690, 1996. 6. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., et al. The Protein Data Bank: A computer-based archival file for macromolecular structures. J. Mol. Biol. 112:535–542, 1977. 7. Flores, T.P., Orengo, C.A., Moss, D.S., Thornton, J.M. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2:1811–1826, 1993. 8. Zegers, I., Maes, D., Dao-Thi, M.-H., Poortmans, F., Palmer, R., Wyns, L. The structures of RNase A complexed with 38CMP and d(CpA): Active site conformation and conserved water molecules. Protein Sci. 3:2322–2339, 1994. 9. Weber, I.T., Harrison, R.W., Iozzo, R.V. Model structure of Decorin and implications for collagen fibrillogenesis. J. Biol. Chem. 271:31767–31770, 1996. 10. Harrison, R.W. Stiffness and energy conservation in the molecular dynamics: An improved integrator. J. Comp. Chem. 14:1112–1122, 1993. 11. Beulter, T.C., van Gunsteren, W.F. Molecular dynamics free energy calculation in four dimensions. J. Chem. Phys. 101:1417–1422, 1994. 12. Wlodawer, A., Nachman, J., Gilliland, G.L., Gallagher, W., Woodward, C. Structure of form III crystals of bovine pancreatic trypsin inhibitor. J. Mol. Biol. 198:469–480, 1987. 13. Dixon, J.S. Evaluation of the CASP2 docking section. Proteins Suppl. 1:198–204, 1997. 14. Martin, A.C.R., MacArthur, M.W., Thornton, J.M. Assessment of comparative modeling in CASP2. Proteins Suppl. 1:14–28, 1997.