Nucleic Acids Research 2007; doi: 10.1093/nar/gkm1049 [MEDLINE]

STRALCP - Structure Alignment-based Clustering of Proteins

Adam Zemla1, Brian Geisbrecht2, Jason Smith1, Marisa Lam1, Bonnie Kirkpatrick1, Mark Wagner1, Tom Slezak1 and Carol Ecale Zhou1

1Computing Applications and Research, Lawrence Livermore National Laboratory, Livermore, CA 94550 USA and 2Division of Cell Biology and Biophysics, University of Missouri-Kansas City, Kansas City, MO 64110 USA

 

*ABSTRACT

Protein structural annotation and classification is an important and challenging problem in bioinformatics. Research towards analysis of sequence–structure correspondences is critical for better understanding of a protein's structure, function, and its interaction with other molecules. Clustering of protein domains based on their structural similarities provides valuable information for protein classification schemes. In this article, we attempt to determine whether structure information alone is sufficient to adequately classify protein structures. We present an algorithm that identifies regions of structural similarity within a given set of protein structures, and uses those regions for clustering. In our approach, called STRALCP (STRucture ALignment-based Clustering of Proteins), we generate detailed information about global and local similarities between pairs of protein structures, identify fragments (spans) that are structurally conserved among proteins, and use these spans to group the structures accordingly. We also provide a web server at http://as2ts.llnl.gov/AS2TS/STRALCP/ for selecting protein structures, calculating structurally conserved regions and performing automated clustering.

 

Benchmark results
(supplemental results posted on STRALCP web page)

 

 

In our benchmark test – comparison with SCOP classification, we have performed STRALCP calculations for 4,620 SCOP domains classified within 25 different folds (see Table 1). Our clustering method is robust in that it detects relationships at the family level with good agreement with the manually maintained SCOP database.

In order to evaluate the accuracy of our clustering approach, we estimated the differences between SCOP (ver. 1.71) and STRALCP clustering (for example on the level of SCOP families) by introducing the following measure. Let:

·        Nc - the number of created clusters,

·        Cf(i) - the number of different families clustered together within the i cluster,

The score indicating the misclustering effect MC (when domains from different SCOP families are grouped together) can be calculated from the formula:


The range of this measure is 0.0 <= MC < 100.0, where 0.0 indicates no misclustering (i.e., agreement with SCOP families separation).

SCOP Fold

Nd

Ns

Nf

Nc

MC

Fold a.5

79

9

13

28

7.74

Fold a.7

126

13

14

26

1.92

Fold a.8

65

8

11

14

4.76

Fold a.24

361

24

32

48

3.12

Fold a.29

79

6

7

12

4.17

Fold a.137

65

10

10

10

0.00

Fold b.2

293

10

23

41

0.00

Fold b.42

291

8

13

17

17.65

Fold b.43

225

4

8

17

2.94

Fold b.68

209

11

13

19

2.63

Fold b.80

122

7

17

20

0.00

Fold b.85

210

7

9

12

0.00

Fold c.8

302

9

11

13

0.00

Fold c.51

114

4

7

7

0.00

Fold c.56

452

6

12

20

5.00

Fold d.52

61

8

9

19

0.00

Fold d.68

104

7

9

17

2.94

Fold d.79

256

7

10

16

0.00

Fold d.110

153

7

17

21

2.38

Fold d.129

218

9

15

21

0.00

Fold f.1

46

5

5

17

0.00

Fold f.4

116

6

11

23

0.00

Fold f.23

295

30

30

37

12.16

Fold g.41

332

14

23

37

1.35

Fold h.4

46

14

14

19

5.26

Total in Folds

4620

243

343

531

2.96

 

Table 1. Results from the evaluation of the differences between SCOP (ver. 1.71) and STRALCP clusters at the level of SCOP families. Here Nd – number of domains within the fold, Ns – number of superfamilies, Nf – number of families, Nc – number of created clusters, MC – misclustering effect calculated by the introduced formula.

 

The MC measure allows the comparison of different clustering schemes by their agreement in separating proteins from different clusters. The goal of this measure is not to calculate how many domains are clustered differently, but of how many the created clusters are compromised (merge proteins that are separated in other clustering scheme). In Table 1 we show the results from applying MC formula to the clusters calculated on selected set of 25 SCOP folds. In our selection of 25 SCOP folds we required that they consist of multiple superfamilies (at least 4). In total, in this benchmark test we have analyzed Nd=4,620 domains from Nf=343 SCOP families, and Ns=243 superfamilies.


 

Results from STRALCP analysis of selected Folds from SCOP ver. 1.71

Fold: a.5      List      LGA_summary      Clustering
Fold: a.7      List      LGA_summary      Clustering
Fold: a.8      List      LGA_summary      Clustering
Fold: a.24     List      LGA_summary      Clustering
Fold: a.29     List      LGA_summary      Clustering
Fold: a.137    List      LGA_summary      Clustering
Fold: b.2      List      LGA_summary      Clustering
Fold: b.42     List      LGA_summary      Clustering
Fold: b.43     List      LGA_summary      Clustering
Fold: b.68     List      LGA_summary      Clustering
Fold: b.80     List      LGA_summary      Clustering
Fold: b.85     List      LGA_summary      Clustering
Fold: c.8      List      LGA_summary      Clustering
Fold: c.51     List      LGA_summary      Clustering
Fold: c.56     List      LGA_summary      Clustering
Fold: d.52     List      LGA_summary      Clustering
Fold: d.68     List      LGA_summary      Clustering
Fold: d.79     List      LGA_summary      Clustering
Fold: d.110    List      LGA_summary      Clustering
Fold: d.129    List      LGA_summary      Clustering
Fold: f.1      List      LGA_summary      Clustering
Fold: f.4      List      LGA_summary      Clustering
Fold: f.23     List      LGA_summary      Clustering
Fold: g.41     List      LGA_summary      Clustering
Fold: h.4      List      LGA_summary      Clustering