Blast clusters -------------- This directory contains the results of the weekly clustering of protein chains in the PDB generated by blastclust. These clusters are used in the "remove similar sequences" feature on the RCSB PDB website. (see also http://www.rcsb.org/pdb/statistics/clusterStatistics.do ) Protein chains that contain less then 20 amino acids are excluded from the clustering. Only known amino acids are counted in the length calculation (amino acids designated as 'X' are excluded). The files bc-30.out is obtained by running blastclust with parameters -c param_file.txt[-e 0.01] -p T -b T -S 30 for clustering at 30% sequence identity. The parameter file param_file.txt sets the e-value to 0.01. Blastclust results at 40%, 50%, 70%, 90%, 95%, and 100% sequence identity are generated accordingly. The file format is one line per cluster. Since BLASTClust does not consider upper case and lower case identifiers to be different, any lower case chain IDs in the PDB are renamed (e.g. 1ABC:a would become 1ABC:aa) prior to running BLASTClust. Basic local alignment search tool, S.F. Altschul, W. Gish, W. Miller, E.W. Myers, & D.J. Lipman (1990) J. Mol. Biol. 215:403-410. http://blast.ncbi.nlm.nih.gov/Blast.cgi cd-hit clusters --------------- For comparison, sequence clusters are also generated by cd-hit. The files clusters50.txt, clusters70.txt, clusters90.txt and clusters95.txt list the clusters at 50%, 70%, 90% and 95% sequence identity. Protein chains that contain less then 20 amino acids are excluded from the clustering. Only known amino acids are counted in the length calculation (amino acids designated as 'X' are excluded). The file format is: cluster# rank# chainID Smaller rank numbers indicate higher (better) ranking. Chains with rank number 1 are ranked as the best representative of their cluster. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Weizhong Li & Adam Godzik (2006) Bioinformatics, 22:1658-9. http://bioinformatics.oxfordjournals.org/cgi/content/full/22/13/1658 The file not_in_clusters.txt contains nucleic acid chains and short polypeptides of fewer than 20 amino acids, which are not clustered. The file XrayAndNmr.txt lists clusters that are likely to be examples of the same protein having been solved by both X-ray and NMR.