STRALCP - Structure Alignment-based Clustering of Proteins
Adam Zemla1, Brian Geisbrecht2, Jason Smith1, Marisa Lam1, Bonnie Kirkpatrick1, Mark Wagner1, Tom Slezak1 and Carol Ecale Zhou1
1Computing Applications and Research, Lawrence Livermore National Laboratory, Livermore, CA 94550 USA and 2Division of Cell Biology and Biophysics, University of Missouri-Kansas City, Kansas City, MO 64110 USA
Protein structural annotation and classification is an important and challenging problem in bioinformatics. Research towards analysis of sequence–structure correspondences is critical for better understanding of a protein's structure, function, and its interaction with other molecules. Clustering of protein domains based on their structural similarities provides valuable information for protein classification schemes. In this article, we attempt to determine whether structure information alone is sufficient to adequately classify protein structures. We present an algorithm that identifies regions of structural similarity within a given set of protein structures, and uses those regions for clustering. In our approach, called STRALCP (STRucture ALignment-based Clustering of Proteins), we generate detailed information about global and local similarities between pairs of protein structures, identify fragments (spans) that are structurally conserved among proteins, and use these spans to group the structures accordingly. We also provide a web server at http://as2ts.llnl.gov/AS2TS/STRALCP/ for selecting protein structures, calculating structurally conserved regions and performing automated clustering.
Benchmark
results
(supplemental results posted on STRALCP web page)
In our benchmark test – comparison with SCOP classification, we have performed STRALCP calculations for 4,620 SCOP domains classified within 25 different folds (see Table 1). Our clustering method is robust in that it detects relationships at the family level with good agreement with the manually maintained SCOP database.
In order to evaluate the accuracy of our clustering approach, we estimated the differences between SCOP (ver. 1.71) and STRALCP clustering (for example on the level of SCOP families) by introducing the following measure. Let:
· Nc - the number of created clusters,
· Cf(i) - the number of different families clustered together within the i cluster,
The score indicating the misclustering effect MC (when domains from different SCOP families are grouped together) can be calculated from the formula:
SCOP Fold |
Nd |
Ns |
Nf |
Nc |
MC |
Fold a.5 |
79 |
9 |
13 |
28 |
7.74 |
Fold a.7 |
126 |
13 |
14 |
26 |
1.92 |
Fold a.8 |
65 |
8 |
11 |
14 |
4.76 |
Fold a.24 |
361 |
24 |
32 |
48 |
3.12 |
Fold a.29 |
79 |
6 |
7 |
12 |
4.17 |
Fold a.137 |
65 |
10 |
10 |
10 |
0.00 |
Fold b.2 |
293 |
10 |
23 |
41 |
0.00 |
Fold b.42 |
291 |
8 |
13 |
17 |
17.65 |
Fold b.43 |
225 |
4 |
8 |
17 |
2.94 |
Fold b.68 |
209 |
11 |
13 |
19 |
2.63 |
Fold b.80 |
122 |
7 |
17 |
20 |
0.00 |
Fold b.85 |
210 |
7 |
9 |
12 |
0.00 |
Fold c.8 |
302 |
9 |
11 |
13 |
0.00 |
Fold c.51 |
114 |
4 |
7 |
7 |
0.00 |
Fold c.56 |
452 |
6 |
12 |
20 |
5.00 |
Fold d.52 |
61 |
8 |
9 |
19 |
0.00 |
Fold d.68 |
104 |
7 |
9 |
17 |
2.94 |
Fold d.79 |
256 |
7 |
10 |
16 |
0.00 |
Fold d.110 |
153 |
7 |
17 |
21 |
2.38 |
Fold d.129 |
218 |
9 |
15 |
21 |
0.00 |
Fold f.1 |
46 |
5 |
5 |
17 |
0.00 |
Fold f.4 |
116 |
6 |
11 |
23 |
0.00 |
Fold f.23 |
295 |
30 |
30 |
37 |
12.16 |
Fold g.41 |
332 |
14 |
23 |
37 |
1.35 |
Fold h.4 |
46 |
14 |
14 |
19 |
5.26 |
Total in Folds |
4620 |
243 |
343 |
531 |
2.96 |
Table 1. Results from the evaluation of the differences between SCOP (ver. 1.71) and STRALCP clusters at the level of SCOP families. Here Nd – number of domains within the fold, Ns – number of superfamilies, Nf – number of families, Nc – number of created clusters, MC – misclustering effect calculated by the introduced formula.
The MC measure allows the comparison of different clustering schemes by their agreement in separating proteins from different clusters. The goal of this measure is not to calculate how many domains are clustered differently, but of how many the created clusters are compromised (merge proteins that are separated in other clustering scheme). In Table 1 we show the results from applying MC formula to the clusters calculated on selected set of 25 SCOP folds. In our selection of 25 SCOP folds we required that they consist of multiple superfamilies (at least 4). In total, in this benchmark test we have analyzed Nd=4,620 domains from Nf=343 SCOP families, and Ns=243 superfamilies.
Fold: a.5 List LGA_summary Clustering Fold: a.7 List LGA_summary Clustering Fold: a.8 List LGA_summary Clustering Fold: a.24 List LGA_summary Clustering Fold: a.29 List LGA_summary Clustering Fold: a.137 List LGA_summary Clustering Fold: b.2 List LGA_summary Clustering Fold: b.42 List LGA_summary Clustering Fold: b.43 List LGA_summary Clustering Fold: b.68 List LGA_summary Clustering Fold: b.80 List LGA_summary Clustering Fold: b.85 List LGA_summary Clustering Fold: c.8 List LGA_summary Clustering Fold: c.51 List LGA_summary Clustering Fold: c.56 List LGA_summary Clustering Fold: d.52 List LGA_summary Clustering Fold: d.68 List LGA_summary Clustering Fold: d.79 List LGA_summary Clustering Fold: d.110 List LGA_summary Clustering Fold: d.129 List LGA_summary Clustering Fold: f.1 List LGA_summary Clustering Fold: f.4 List LGA_summary Clustering Fold: f.23 List LGA_summary Clustering Fold: g.41 List LGA_summary Clustering Fold: h.4 List LGA_summary Clustering