STRALCP - Structure Alignment-based Clustering of Proteins

Adam Zemla¹, Brian Geisbrecht², Jason Smith¹, Marisa Lam¹, Bonnie Kirkpatrick¹, Mark Wagner¹, Tom Slezak¹ and Carol Ecale Zhou¹

¹Computing Applications and Research, Lawrence Livermore National Laboratory, Livermore, CA 94550 USA and ²Division of Cell Biology and Biophysics, University of Missouri-Kansas City, Kansas City, MO 64110 USA

*ABSTRACT

Protein structural annotation and classification is an importantand challenging problem in bioinformatics. Research towardsanalysis of sequence–structure correspondences is criticalfor better understanding of a protein's structure, function,and its interaction with other molecules. Clustering of proteindomains based on their structural similarities provides valuableinformation for protein classification schemes. In this article,we attempt to determine whether structure information aloneis sufficient to adequately classify protein structures. Wepresent an algorithm that identifies regions of structural similaritywithin a given set of protein structures, and uses those regionsfor clustering. In our approach, called STRALCP (STRucture ALignment-basedClustering of Proteins), we generate detailed information aboutglobal and local similarities between pairs of protein structures,identify fragments (spans) that are structurally conserved amongproteins, and use these spans to group the structures accordingly.We also provide a web server at http://as2ts.llnl.gov/AS2TS/STRALCP/for selecting protein structures, calculating structurally conservedregions and performing automated clustering.

Benchmark results
(supplemental results posted on STRALCP web page)

In our benchmark test – comparison with SCOP classification, we have performed STRALCP calculations for 4,620 SCOP domains classified within 25 different folds (see Table 1). Our clustering method is robust in that it detects relationships at the family level with good agreement with the manually maintained SCOP database.

In order to evaluate the accuracy of our clustering approach, we estimated the differences between SCOP (ver. 1.71) and STRALCP clustering (for example on the level of SCOP families) by introducing the following measure. Let:

· Nc - the number of created clusters,

· Cf(i) - the number of different families clustered together within the i cluster,

The score indicating the misclustering effect MC (when domains from different SCOP families are grouped together) can be calculated from the formula:

The range of this measure is 0.0 <= MC < 100.0, where 0.0 indicates no misclustering (i.e., agreement with SCOP families separation).

SCOP Fold	Nd	Ns	Nf	Nc	MC
Fold a.5	79	9	13	28	7.74
Fold a.7	126	13	14	26	1.92
Fold a.8	65	8	11	14	4.76
Fold a.24	361	24	32	48	3.12
Fold a.29	79	6	7	12	4.17
Fold a.137	65	10	10	10	0.00
Fold b.2	293	10	23	41	0.00
Fold b.42	291	8	13	17	17.65
Fold b.43	225	4	8	17	2.94
Fold b.68	209	11	13	19	2.63
Fold b.80	122	7	17	20	0.00
Fold b.85	210	7	9	12	0.00
Fold c.8	302	9	11	13	0.00
Fold c.51	114	4	7	7	0.00
Fold c.56	452	6	12	20	5.00
Fold d.52	61	8	9	19	0.00
Fold d.68	104	7	9	17	2.94
Fold d.79	256	7	10	16	0.00
Fold d.110	153	7	17	21	2.38
Fold d.129	218	9	15	21	0.00
Fold f.1	46	5	5	17	0.00
Fold f.4	116	6	11	23	0.00
Fold f.23	295	30	30	37	12.16
Fold g.41	332	14	23	37	1.35
Fold h.4	46	14	14	19	5.26
Total in Folds	4620	243	343	531	2.96

Table 1. Results from the evaluation of the differences between SCOP (ver. 1.71) and STRALCP clusters at the level of SCOP families. Here Nd – number of domains within the fold, Ns – number of superfamilies, Nf – number of families, Nc – number of created clusters, MC – misclustering effect calculated by the introduced formula.

The MC measure allows the comparison of different clustering schemes by their agreement in separating proteins from different clusters. The goal of this measure is not to calculate how many domains are clustered differently, but of how many the created clusters are compromised (merge proteins that are separated in other clustering scheme). In Table 1 we show the results from applying MC formula to the clusters calculated on selected set of 25 SCOP folds. In our selection of 25 SCOP folds we required that they consist of multiple superfamilies (at least 4). In total, in this benchmark test we have analyzed Nd=4,620 domains from Nf=343 SCOP families, and Ns=243 superfamilies.

Results from STRALCP analysis of selected Folds from SCOP ver. 1.71