VAST: Vector Alignment Search Tool
 
 
 
Non-redundant PDB chain set
 
  The Non-redundant PDB chain set gives you a set of sequence-dissimilar PDB polypeptide chains. It is derived by clustering chains into groups according to their amino acid sequence similarities and selecting a representative from each of those groups. (See below for details of the method).

Four sets of chains of different non-redundancy are available. They are based on the clustering using four different sequence-similarity cutoffs: BLAST p-value of 10e-7, 10e-40, 10e-80, and 100% sequence identity (see below). The set based on the p-value cutoff of 10e-7 is the most non-redundant one. The one based on 100% sequence identity simply gives a set of all chains with different sequences.

Below, you can browse the non-redundant set, or browse clusters of sequence-similar chains from each of which one chain was selected to enter into the non-redundant set. You can also download an ASCII text file that summarizes the non-redundant PDB chain set.

 
 

Representatives

This lists a set of sequence-dissimilar chains (non-redundant set):

  Non-redundancy Display

 
 

Cluster of Sequence-Similar Chains

This shows a cluster of sequence-similar chains to which the query PDB chain belongs and from which a representative is selected to enter into the non-redundant set:

Non-redundancy Query  
PDB code
chain id   

 
 

Download Summary Table

You can download an ASCII text file that summarizes the non-redundant sets present in the current MMDB release or previous releases from the MMDB FTP site:

ftp://ftp.ncbi.nih.gov/mmdb/nrtable/

 
 
 
 
Method for making the non-redundant set
 
  All the chains available from PDB are compared with each other using the BLAST algorithm as implemented in the NCBI toolkit library. They are then clustered into groups of sequence-similar chains using the single-linkage clustering procedure. Chains within a sequence-similar group thus derived are automatically ranked according to the precision and completeness of their structural data. The following measures of the structural quality are used in this order of priority:

  1. Lower percentage of residues with unknown amino acid type,
  2. Lower percentage of residues with incomplete coordinate data,
  3. Lower percentage of residues whose coordinate data are missing,
  4. Lower percentage of residues with incomplete side-chain coordinate data,
  5. Higher resolution,
  6. Larger number of chains (subunits) contained in the PDB entry,
  7. Larger number of heterogens contained in the PDB entry,
  8. Larger number of different types of heterogens,
  9. Larger number of residues, and
  10. Alphanumerical order of their PDB codes.
The top-ranked chain is generally chosen as the representative of the group. In some cases, however, a lower-ranked chain may be chosen by the authors manually. For example, if the top-ranked chain was a mutant protein and there was a native protein with reasonably comparable structural quality, then that lower-ranked native protein might replace the mutant. Representatives from all the groups together form a non-redundant set.

In comparing sequences, the database-size parameter of the BLAST algorithm is fixed at 500,000. This allows the use of the constant p-value cutoffs in clustering chains. In clustering chains, four different similarity cutoffs are used. They are: BLAST p-values of 10e-7, 10e-40, 10e-80, and 100% sequence identity. This results in a hierarchical clustering of PDB chains and four sets of representatives of different non-redundancy.

The non-redundant set does not include chains with less than 20 residues or chains whose coordinates are a theoretical model. A chain with more than 5% "UNKNOWN" residues is included in the clustering but will not be selected as a representative.

The non-redundant set is updated on a regular basis (about once a month), in synchronization with updates of MMDB and the VAST database of structure neighbors.
 
 
 
 
Revised 26 September 2016