Similarity Search

João Carlos Setubal; Ruediger Braeuning

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Gruber A, Durham AM, Huynh C, et al., editors. Bioinformatics in Tropical Disease Research: A Practical and Case-Study Approach [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2008.

Cover of Bioinformatics in Tropical Disease Research

Bioinformatics in Tropical Disease Research: A Practical and Case-Study Approach [Internet].

Show details

Contents

< Prev Next >

Chapter A05Similarity Search

João Carlos Setubal, PhD and Ruediger Braeuning, PhD.

Author Information and Affiliations

Created: February 10, 2006; Last Update: March 9, 2007.

João Carlos Setubal has authored the theoretical part of this chapter and Ruediger Braeuning has authored the tutorial.

Why Sequence Comparison Is Important

Consider the following protein sequence:

It is the sequence of a small protein found in humans, called cytochrome c. Its function is to carry electrons in the cell. The full description from the SwissProt database (http://us.expasy.org, also mirrored at other sites) is:

Electron carrier protein. The oxidized form of the cytochrome c heme group can accept an electron from the heme group of the cytochrome c1 subunit of cytochrome reductase. Cytochrome c then transfers this electron to the cytochrome oxidase complex, the final protein carrier in the mitochondrial electron-transport chain.

This description, although irrelevant in what follows, serves the useful purpose to remind us that proteins are three-dimensional structures that perform specific biochemical roles in a cell. It is easy to forget this fundamental fact when dealing with the strings of letters that represent proteins (or DNA), as I do in this entire chapter. Now consider the following sequence:

It is the sequence of a protein found in mice, whose function is the same as that of the human protein, and therefore is also known as cytochrome c. A cursory look at these two sequences suggests that they are very similar. In fact, one could play the children's game "spot the differences" and, with some patience, visually determine all differences that exist between these two sequences. For example, the first difference is in the 11th residue: in the human sequence, we find an I, whereas in the mouse sequence there is a V.

One way to be more systematic about finding the differences is to align these two sequences. This means placing one on top of the other, so that it is easier to tell which positions are the same and which positions are different. Here is this alignment, with the human sequence on top and the mouse sequence on the bottom:

Notice that the alignment had to be broken into two sections, because the sequences are too long to fit in a single line. Notice also that there are asterisks below the second sequence to show positions where there is a match between a residue (or character) in the first sequence and a character in the second sequence. Thus, we can immediately see that there are exactly nine positions where these two sequences differ, of a total of 104 positions. This gives us our first measure of similarity between these two sequences: they are 91% (95/104) identical.

Intuition suggests that this is a high level of similarity, and indeed it is. It is no coincidence that two proteins, although from two different species, have the same function with highly similar sequences. Humans and mice are mammals, and they share an ancestor that probably existed about 80 million years ago. It is likely that this ancestor had a cytochrome c protein, and the sequences above are descendants of that primitive sequence, just as each species is as a whole (but please note the caveats in the next section). Many of the protein sequences in mice and humans are similar for this reason, and this gives us a powerful tool to find clues about protein function. In the case of cytochrome c, independent bench experiments determined the function of each of the two proteins. For many other newly discovered proteins, however, we can formulate hypotheses for their function based on similarity with known proteins. This is a primary motivation for comparing sequences.

If sequence similarities were restricted to closely related species, such as humans and mice, sequence comparison would be of limited use. Fortunately, this is not the case, as can be shown by comparing the human cytochrome c sequence to cytochrome c sequences from other organisms.

Let us start with Drosophila melanogaster, the fruit fly. Here is its sequence:

This time it is not so straightforward to align this sequence to the human sequence. The first problem is that this sequence has 107 characters, instead of 104. So it is not immediately clear how to align one on top of the other, as we did before. But again, some visual inspection suggests that the sequences are very similar. For example, we could notice that both sequences have the strings GDVEKGKK at or near the beginning.

On the basis of this observation, we could create the following alignment:

We have obtained an alignment where there are 80 identical positions. Again, this level of similarity is very high. If we did not know what the function of the D. melanogaster protein was, we would be able to hypothesize that it has the same function as the one from human, although humans and flies are very different organisms.

Now let us see what we can do with the cytochrome c sequence from Caenorhabditis elegans, a worm and well-studied model organism. Here is its sequence:

This sequence has 107 characters, so we have again the problem of differing lengths. Our method of visual inspection may suggest local similarities, but to obtain a good alignment, it will not suffice to simply shift one sequence with respect to the other as we did with the human and fly comparison. Here is what happens if we do this (we show only the first part of the alignment):

Notice that there are several matches in the beginning, but they then virtually disappear. However, visual inspection shows that there would be several matches in the second half of this alignment if only we could shift the C. elegans sequence by one position, in the middle, thus:

Notice the introduction of a "space" character (represented by the hyphen) to accomplish the desired shift, aligning it with an H residue in the human sequence. The rest of the alignment becomes:

We have obtained an alignment with 59 matches, which is still very good, again showing the high degree of conservation of this particular protein sequence, even between organisms so different as humans and worms. However, from an alignment-construction point of view, two basic questions arise from this last example. How can we be sure that the alignment shown is the best one? Maybe there is another way of introducing spaces that will give us more than 59 matches. The second question is: What method should be used to create[AG1] alignments? Thus far, we have done it manually. If we had to deal with longer sequences, then doing a manual alignment and using a trial-and-error method would clearly be very tedious and prone to error. These two questions suggest that we need a rigorous way for defining good alignments (and in particular, the best alignment); and we need an algorithm, that is, a mathematical procedure that can be implemented in a computer program, to find the best alignment. This motivates the sections that follow.

Caveats Regarding Similar Sequences, Evolutionary History, and Gene Function

In the previous section, we made the statement that highly similar protein sequences are probably derived from a common ancestor sequence, and that they also probably have the same molecular function. Before we plunge into technical details regarding sequence similarity, it is important to qualify these statements.

The keyword in this qualification is "probably". We are making inferences about past history, and therefore we can never be 100% sure. Indeed, there are known cases of similar sequences that do not share a common ancestor. This can happen due to chance (more on that in a later section) or due to a phenomenon called convergent evolution (whereby two sequences or characteristics evolve to become similar but follow different paths, as is the case of wings in bats and birds). But it is an observed fact that the more two sequences are similar, the higher the probability that they are derived from a common ancestor.

Note also that one should be very careful about generalizing the evolutionary history of particular genes to that of the species as a whole to which the genes belong. The cytochrome c example of the last section suggests that the evolutionary history of this protein reflects the evolutionary history of the different species mentioned. This is probably true for cytochrome c, but it is by no means always the case with other proteins. It is known that the genome of all organisms sometimes incorporates DNA from other organisms, a phenomenon called lateral (or horizontal) transfer, as opposed to the more traditional mode of vertical descent, from parent to offspring. When such an event happens, and the foreign DNA contains genes (as it usually does), the gene in question will have a very different evolutionary history compared with the genes that have been vertically transmitted.

Finally, note that although sequence similarity is usually a good indication of the same protein function, the converse is not true. There are plenty of examples of proteins with the same function that share no significant similarity, even at the amino acid level.

What Is Sequence Similarity?

We now address how to define a measure of similarity between sequences. In fact, we have already used one such measure without defining it: the percent identity similarity measure. In this measure, similarity between two sequences is defined as the maximum number of matches that can be found in an alignment between the sequences divided by the alignment length. The notion of maximum number of matches implies that we need the best alignment between the sequences; in other words, the best alignment is the one that yields the maximum number of matches.

When we compared the human and mouse sequences above, we obtained the value 91% for their similarity under this measure. How can we be sure that it is correct (i.e., the maximum number of matches)? That alignment was so simple to create that a manual inspection should be enough to convince us that it is the best possible alignment. For more complicated alignments, we will need an algorithm to find the best alignment, but we will defer the topic of algorithms for the next section. Assuming that the other alignments shown in the previous section do yield the maximum possible number of matches, we would obtain the following similarity values: human vs. fly: 80/108 = 74%; human vs. worm: 59/111 = 53%. Notice that the denominator changes in all three cases, because each alignment has its own length. Note also that this simple similarity measure can in principle be used to quantify the evolutionary distance that separates humans from mice, flies, and worms. Evolutionary analysis based on sequence comparison is a fascinating topic, but it is beyond the scope of this chapter.

Percent similarity is a reasonable measure of similarity, but there is another, more sophisticated one that is more widely used. It takes into account the fact that when aligning sequences, we very often to use spaces, as was the case with the human vs. worm comparison of the last section. The use of spaces introduces a third case of single-character alignment. The first two are the match and the mismatch. The third one is when a character is matched to a space. It should be intuitive that an alignment between a character and a space is an event different in kind from an alignment between two different characters (a mismatch). A mismatch implies a substitution, that is, one residue was replaced by another one over the course of evolution (or perhaps both were replaced). A space implies either a deletion in the sequence that needs it in the alignment or an insertion in the other sequence. Evolutionarily speaking, insertions/deletions are events different from substitutions and, therefore, should be considered in different ways when evaluating alignments.

In the percent identity similarity measure, we simply counted the number of matches. This is equivalent to saying that each match counts for +1 and each mismatch or space counts for 0. If we want to treat mismatches and insertions/deletions (also known as indels) differently, then we need different numbers for them. It is known that indels are rarer events than substitutions. Therefore we will assign −1 to mismatches and −2 to indels (we use negative values to stress the fact that these are penalties, which contribute to decrease the total similarity score).

What we have just created is a scoring system for alignments. Whenever two sequences are compared and their similarity is computed, there has to be an underlying scoring system. The scoring system just described is very simple, but it has been shown to be good enough for most nucleotide (or DNA) alignments. Let us see a simple example:

In this alignment, we are using vertical bars to denote matches. Notice that there are three indels and one mismatch. Using the scoring system defined above, we obtain the value 1 + 1 − 2 + 1 + 1 − 1 + 1 − 2 + 1 + 1 + 1 − 2 + 1 + 1 = 3. What is this value? If this is the maximum possible score for any alignment between these two sequences, we will say that it is their similarity under the new measure. Finding this kind of best alignment requires an algorithm, just as it does for percent identity.

The measure of similarity just described is so widely used that it does not have a special name, as percent identity does. It is simply referred to as the similarity in many textbooks. However, a cautionary remark has to be given here concerning the word homology. In many scientific papers and published literature in general, it is common to say that two sequences have such-and-such percent homology, or that one sequence "has homology" to another. In most cases, the authors in fact are making statements about the similarity of the sequences being compared, and not about their homology. Two sequences are homologous when they share a common ancestor, that is, they are phylogenetically related. Two highly similar sequences may indeed be homologous (as is the case of the human and mouse cytochrome c sequences seen before), and high similarity is a strong indication of homology. However, one should never lose sight of the fact that the numerical quantity associated with the comparison is similarity (under various guises), and that homology is a hypothesis about their shared evolutionary history. The term homology search, frequently used when doing sequence comparison in sequence databases, can, in light of this explanation, be understood as a search for similar sequences that, if similar enough, will likely be homologous.

Another important comment here is about "degree of similarity". We have said that the human and mouse sequences are highly similar. But what does "high" mean? What number has to be achieved to be considered high? In the case of percent identity, we have an intuition (90% seems high and 10% seems low), but in the case of the value 3 obtained in the nucleotide comparison shown above, we are more or less at a loss to state whether it is high or low. This is an extremely important question in the study of sequence comparison, and as we shall see, its answer lies in the statistical significance of the value obtained. But before we go into that, we have to address the similarity computation question.

Finding the Similarity Between Two Squences by Dynamic Programming

We will present an overview of the algorithm that finds the best alignment between two sequences under the +1/−1/−2 scoring scheme defined in the previous section. Recall that an algorithm is a mathematical procedure with well-defined steps that runs in[AG2] a finite time and is guaranteed to yield a certain result (the similarity and the corresponding alignment, in our case). An algorithm is different from a computer program; a program is the embodiment (or implementation) of an algorithm in a particular programming language and to be run on particular computers.

The algorithm we will describe relies on a design technique called dynamic programming (DP). This is a historical name that has essentially nothing to do with the way the technique works. DP is used in many bioinformatics algorithms, and references to it abound in the published literature. So it is worth having a general idea of the concept.

DP works basically by constructing a table to store intermediary results that are used to obtain the final result. Here is the general idea, explained through a toy problem. The problem is to go from point A to point E using the shortest path. It is the case that point C lies between A and E. If we know the best (shortest) way to get from A to C, then what remains to be solved is how to go from C to E. In DP, we would store in a table the A-C solution and then use it later to decide the best way to go from A to E. See Figure 1.

Figure 1

General idea of how dynamic programming works. The problem is to find the shortest path from A to E. Numbers on arrows denote the distance between the two points linked. We initially find the shortest way from A to C (which is through B₁ and is 7 units (more...)

In the case of sequence alignment, DP works by trying to align one sequence with parts of the other sequence. These "intermediary" alignments correspond to intermediary "nodes" in problems similar to those illustrated in Figure 1. Let us explain through an example. Suppose we want to align sequence s = GATC to sequence t = ATA. In terms of what happens with the last column of the alignment, there are exactly three possibilities: either we have C aligned with a space, or C aligned with A, or A aligned with a space. There can be no other possibility. Each of them will have its score (−2,−1,−2, respectively). Which one we should choose will depend on the best possible intermediary alignment that each implies. Let us see what this means.

If we align C with a space, the score of the final alignment would be −2 plus the score of the best alignment between GAT and ATA. If we align C with A, the score of the final alignment would be −1 plus the score of the best alignment between GAT and AT. Finally, if we align A with a space, the score of the final alignment would be −2 plus the score of the best alignment between GATC and AT. Because in DP we keep a table with all intermediary results, it is a simple matter to choose the best one: it is simply the maximum value among all three sums considered.

The overall situation is best visualized by a table:

		G	A	T	C
	0	−2	−4	−6	−8
A	−2	−1	−1	−3	−5
T	−4	−3	−2	0	−2
A	−6	−5	−2	−2	−1

The first row contains sequence s, and the first column contains sequence t. The second row contains the values obtained assuming that all characters of s are aligned with spaces. There is an analogous situation for the second column. The other entries are the result of systematic computation, as was described above for the last character. The exact same reasoning applies to the alignment decision to be done for the ith character in s and the jth character in t. The similarity will be the number obtained for the bottom right cell in the matrix (−1 in this case). It corresponds to the alignment:

The algorithm to obtain the similarity simply fills out this table in the systematic way described. The time it will take will be proportional to the size of the table (in terms of the number of cells), which is clearly the length of s plus 1 mulTiplied by the length of t plus 1. If s and t are of approximately the same length, then the algorithm can be said to run in quadratic time with respect to the lengths of the input sequences. This has important consequences, as we shall see later on.

There is a relationship between the shortest path toy problem that we presented above and the sequence comparison problem that goes deeper than the fact that both are problems that can be solved by DP. Each cell in the DP matrix can be thought of as connected to the neighboring cells by labeled "arrows", much in the same way that nodes in the shortest path problem are connected by labeled arrows. A generic matrix cell can be depicted as in Figure 2.

Figure 2

A generic cell of the dynamic programming matrix for sequence comparison.

The numeric label on each arrow gives the "distance" from (or to) the neighboring cell. The numbers are exactly those of the scoring scheme that we have described. The diagonal arrows have two labels because the "distance" will depend on whether the two characters are the same (score +1) or different (score −1). In this framework, the filling of the table is equivalent to finding the longest path (as opposed to shortest, in the toy example above) from the top left cell to the bottom right cell. It is longest because similarity is a maximum value.

Filling out the table only gives the similarity (a number). To retrieve the actual alignment, we need another procedure. After the table has been completed and starting from the bottom right cell, all we need to do is to determine which previous cell was chosen in determining that cell's value, exactly as we described above. In the example, the value −1 in the bottom right cell was the result of 0 − 1, where the 0 comes from the cell diagonally across. This means that we aligned C with A. To learn the next position of the alignment, we go to the cell containing 0 and ask the same question, and so on. This is called the traceback.

We will now give an overview of a more sophisticated algorithm, which is closer to the one actually used in practice when sequence comparison is done by dynamic programming. The difference with respect to the algorithm just described lies with the way a series of consecutive spaces (also called gaps) are treated. It has been observed that certain homologous proteins differ from each other not only by substitutions or by single indels but also by blocks of insertions and deletions. Consider the alignment:

This alignment has three consecutive spaces in the top sequence. It probably reflects a single deletion event in which all three corresponding residues were lost by the protein (alternatively, there could have been an insertion in the bottom sequence). If we were to charge for each space separately, this would mean treating each space as a separate evolutionary event. In general, we do not want to do that. We want to charge it as if it were a single event. One solution would be to charge just −2 for the whole gap. However, the longer the gap is, the less likely it is that it was indeed a single event. We want a formula that will enable us to charge more for longer gaps. One such formula that is widely used is:

where p(k) it the amount to be charged; k is the gap length (number of spaces); and h and g are constants, such as −2 and −1. Using these values, the formula means that for a gap with a single space, we will charge −3; for a gap with two spaces, we will charge −4, and so on. The constant h corresponds to the value we charge to open a gap; the value of g determines the cost based on the gap length. Notice how the formula yields a different result for a gap than what we would get if we were charging for each space separately. For k = 10, we would charge −20 if each space were separately charged; with the formula, the cost goes down to −12. Such a way of charging gaps is known as an affine penalty function.

We have to change the algorithm described above if we use an affine gap penalty function, and here is a brief overview about how it is done. Important details will be omitted, so do not expect a full understanding of the algorithm. As before, there is a table to store results; but instead of a single table, there are three separate tables. Each table corresponds to one of the three possible cases of alignment of a given column (character in s with space, or character in s with character in t, or character in t with space). The separation in three tables allows us to distinguish the case where we have a single space from the case where the space is simply the first in a gap. This distinction is crucial for the application of the affine formula.

Although the algorithm requires three tables instead of one, its running time remains quadratic in the length of the input sequences. In other words, there is no significant increase in running time because of the use of the affine gap penalty function as compared with the more simple way of charging spaces individually; this fact has accounts for the widespread use of DP with affine gap penalty functions.

The two algorithms described will perform a global alignment between two sequences. This means that all of sequence s has to be aligned with all of sequence t. In most cases, just a local alignment is desired. A local alignment is an alignment of a region of sequence s (usually called a substring or subsequence, although the two concepts are mathematically different) and a region of sequence t. Figure 3 illustrates schematically global and local alignments. In many cases, the best similarity score between two sequences is obtained by a local alignment instead of a global alignment; however, there are other cases in which the correct alignment to be obtained is global, even if its score may not be as good as the best local alignment between the same sequences. Yet another form of alignment is known as semi-global. In this case, we want to find the best alignment between a prefix of a sequence (the initial part) with the suffix (the final part) of another sequence. Such an alignment is needed in DNA fragment assembly and roughly corresponds to finding puzzle pieces (DNA fragments) that interlock. It is possible to easily modify the two algorithms described so that they will compute local or semi-global alignments; we do not describe such modifications here.

Figure 3

Schematic illustration of global and local alignments. The boxes represent sequences, and the shaded parts are aligned.

The dynamic programming algorithm with an affine gap penalty function was developed by Smith and Waterman (6) and is therefore known as the Smith-Waterman algorithm. Many efficient and open-source implementations of the Smith-Waterman algorithm can be obtained through the Internet.

Amino Acid Scoring Systems

When comparing protein sequences, simple scoring schemes, such as +1 for a match, −1 for a mismatch, and −2 for a space, are not appropriate. To understand why, let us once again refer to the cytochrome c alignment between the human and worm sequences we have presented in the first section, which we copy here for convenience.

Let us look at the mismatches that appear after the first string of matches. In the first mismatch, we have I (isoleucine) aligned with V (valine). The biochemical properties and sizes of I and V are similar (both are hydrophobic and aliphatic). The amino acids in the next mismatched pair, F (phenylalanine) and Y (tyrosine), are also similar in size and properties. For the following pair, I and K (lysine), this is no longer true. These two have different properties and sizes. It would be advantageous to have a scoring system that reflects these similarities and differences; the reason lies with the way the evolution of biological molecules proceeds. If a nucleotide mutation in a codon causes change in the corresponding amino acid, that mutation may or may not persist, depending on the relationship between the original amino acid and the new one. For instance, mutations that result in an amino acid of size similar to the original one, or with similar biochemical properties, are more likely to succeed (meaning that the mutations still yield a sequence that is successfully translated into a working protein). Therefore, when aligning protein sequences, we would like to give a higher score to an alignment between I and V than to one aligning I and K.

Intense and detailed studies of proteins known to be homologous have been carried out over the years. The outcome of these studies has been a number of substitution matrices that can be used to score individual amino acid alignments when comparing protein sequences. One of the most commonly used of such matrices is called BLOSUM62. It appears in Figure 4 (BLOSUM comes from BLOck Substitution Matrix, and these matrices were originally proposed by Henikoff and Henikoff (15)).

Figure 4

BLOSUM62 Substitution Matrix; see source, ftp://ftp.ncbi.nlm.nih.gov/blast/matrices

Let us understand this matrix. First of all, it is easy to use it to determine the score of a given pair of aligned amino acids. Just find one amino acid in a row and then the other amino acid in a column. The corresponding entry in the matrix gives the score for their alignment (the matrix is symmetric, so it does not matter which of the two amino acids you pick to find the row). So, for example, for I and V, we get a score of +1, whereas for I and K, the score is −3. Notice that the diagonal contains the highest values (the self scores). These values are used for matches. It may seem strange that these values are not all equal, but this reflects the fact that over the course of evolution, it may be the case that after a series of substitutions, an amino acid can revert back to what it was originally. The higher the probability of this happening, the lower the self score of this amino acid (in other words: amino acids that can be more easily substituted by other amino acids will have lower self scores than those that are less easily substituted). So tryptophan (W), the largest of all amino acids, has the highest self-score, and this means that if we see a W aligned with a W, it is more probable (with respect to the other amino acids) that no substitution at all happened at this position. Tryptophan's low substitution rate may reflect the unique role it plays in protein folding because of its relatively large size. [AG3]

If we use BLOSUM62 to score a human vs. worm cytochrome c sequence alignment, this is what we get (only part of the alignment is shown):

We are presenting here yet another format to display alignments. The original sequences appear in the topmost and bottommost rows. In the middle are shown the matches (by repeating the matched character) as well as the positive mismatches, or simply positives (denoted by the + sign), which are those single-character alignments that have positive scores according to the substitution matrix that was used. This shows that the similarity between the human and the worm sequence is even better than what we obtained when we considered only a percent identity scoring system. This increase in accuracy is highly desirable, especially when comparing distantly related organisms, as is the case of humans and worms.

Ignoring the unmatched residues at the start of the human sequence (refer back to the original alignment in the first section), the human/worm alignment has a score of 339 when using BLOSUM62. We will see later on that more important than this score (usually called the raw score) is the statistical significance of the score.

Why does BLOSUM62 have a number in its name? Because evolution takes place over many millions of years, the kinds of substitutions observed between two sequences will differ, depending on how far apart they are, evolutionarily speaking. Certain kinds of substitutions will be more probable for proteins that have diverged for a long time as compared with proteins that have diverged more recently. The number in the matrix reflects this fact (62 means 62% identity; and, roughly speaking, it is a measure of the minimum similarity among the proteins or protein segments used to derive the matrix). Higher BLOSUM matrix numbers (e.g., BLOSUM80) are derived from more recently diverged proteins, and lower numbers (e.g., BLOSUM45) are derived from more distant proteins. This implies that substitution penalties in higher-numbered BLOSUM matrices are lower than in lowered-numbered ones. Of course it may be the case that we do not know beforehand how far apart, evolutionarily, are the two sequences that we want to compare. In those cases, it is thought that BLOSUM62 is a good compromise. [AG4]

Another set of substitution matrices used in sequence comparison is called PAM (for Point (or Percent) Accepted Mutations, described in Dayhoff, Schwartz and Orcutt (14)). PAM matrices give an estimate of substitution rates among amino acids, assuming a certain percentage of amino acid changes. For example, the PAM1 matrix assumes one change for every 100 amino acids on average; the PAM250 matrix assumes 250 changes for every 100 amino acids on average. Thus, the numbers in the PAM matrix are directly related to evolutionary distance (which is the opposite of BLOSUM matrices, where the numbers are inversely related to evolutionary distance). PAM matrices other than PAM1 were not computed from real data, but extrapolated from PAM1. Historically PAM matrices were proposed before BLOSUM matrices, and for general sequence comparison they are considered less reliable than BLOSUM matrices. [AG5]

Explaining in detail how these matrices are built is beyond the scope of this chapter. Please see the references at the end.

Sequence Databases and Statistical Significance of Alignments

Because of technological advances in DNA sequencing, an enormous amount of sequences for various organisms has been generated in the past 20 years. Moreover, the pace of sequencing is still exponential, which means that the current number of available sequences, however large, will be small compared with what we can expect to have a few years from now.

This situation has led to the creation of the so-called public sequence databases. These are DNA and protein sequence repositories, of which there are many around the world. One of the most important is called GenBank and is maintained by the National Center for Biotechnology Information (http://ncbi.nlm.nih.gov). A separate chapter would be needed to give a full description of the sequence data kept in GenBank and the ways it can be accessed. More information can be found in the chapter about GenBank in the NCBI Handbook. For the purposes of this chapter, however, we will simply consider that GenBank provides a catalog of sequences containing millions of entries; we will call this catalog simply "the database", and the process of comparing one sequence against all sequences in the catalog as the database search problem.

This introduction sets the stage for the problem that we would like to consider in this section: what to do if we have a new sequence s and want to know whether it is similar to another sequence present in the database. This has become an extremely important problem, precisely because of the usefulness of sequence comparison as indicated at the beginning of this chapter. Given what we have already covered, there is a simple (but naïve, as it will turn out) answer to the problem: simply compare s with every sequence t present in the database using the dynamic algorithm program that we have learned. This approach, however, immediately raises two significant questions (1): how long will this comparison take? and (2) how can we tell which database sequences are significantly similar to our sequence? Let us address each of these problems in turn.

Let us do a back-of-the-envelope calculation to see what kinds of running times we would get, were we to apply the dynamic programming algorithm for pairwise sequence comparison to the database search problem. Let us assume that each pairwise comparison takes one-tenth of a second (a fair assumption in a comparison of two 1-kb-long DNA sequences, for a very good implementation of the Smith-Waterman algorithm, running on a top-of-the-line desktop), and that the database contains one million sequences (also fair). Then, to do just one query to the database (that is, compare one sequence with all others) would take us 100,000 seconds, or almost 28 hours. In today's world, this is unacceptably long. This calculation does not even take into account that the Smith-Waterman algorithm runs in a time proportional to the product of the sequences being compared. In other words, for sequences twice as long, the algorithm will take four times as long, roughly speaking. This state of affairs was recognized early on in the history of sequence databases and led to the development of more efficient algorithms for database sequence comparison. The best known of these is called BLAST and will be the subject of the next section. Before we describe BLAST, however, we will turn to the second question raised above.

We have mentioned two similarity concepts: the percent identity and the similarity based on a scoring system. Percent identity is reasonably intuitive, as already pointed out: if two sequences are sufficiently long and have 90% identity, then there are excellent grounds to believe that this is not by chance (in other words, that the genes that they represent are homologous; but remember the caveats from the second section). On the other hand, a number such as 20% identity seems too low. To see this, let us determine the expected percent identity between two totally random DNA sequences of exactly the same length. If we do not allow gaps, then there is just one possible global alignment. In each column of this alignment, there are exactly 16 possibilities (4 possible nucleotides above times 4 possible nucleotides below). Of those 16 possibilities, 4 produce identities. So we can expect that in each column, there is a 25% chance of a match, and this is the expected percent identity in an ungapped alignment between two random sequences of the same length. Although this reasoning is extremely simplified, it can be concluded that for DNA sequences, an optimal alignment with percent identity not much larger than 25% is indicative that the two sequences are not related.

We can be more precise about these statements, but it turns out that percent identity is less used and therefore less important than similarity based on a scoring system. However, in such a system, the similarity number is much less intuitive. The score between the two cytochrome c sequences obtained with the BLOSUM62 matrix is 339, as we saw in the previous section. What does this number mean? Is it good or bad? Just as we did for the percent identity, we need to compare this number to numbers that would have been obtained by chance alignments. In other words, we have to attach some statistical significance to alignments. A statistically significant alignment will be one that is unlikely to have arisen just by chance. The more significant the alignment is (and the significance must be quantified), the higher the probability is that it is not a chance alignment.

The theory of statistical significance of alignments is beyond the scope of this chapter. We will limit ourselves to describing, in the next section, how the BLAST algorithm for the database search problem deals with statistical significance of the alignments it finds.

BLAST

BLAST is an algorithm/program for sequence comparison first published in 1990 (7). The name is of course catchy; moreover, it is an acronym for basic local alignment search tool. As already mentioned, it was developed specifically for database searches, although it can be used to compare any two sequences. BLAST is much more efficient than the Smith-Waterman algorithm, but this efficiency is achieved at a small price in terms of quality of results.

In this section, we give an overview of BLAST. It should be enough to give readers a basic understanding of how BLAST works, but it comes nowhere near a full explanation of the BLAST algorithm and associated statistical theory. Interested readers should refer to the bibliography to learn more or to the chapter about BLAST in the NCBI Handbook.

Before we start with technical details about BLAST, it is worth mentioning a few facts about it. In 1997, a new version of BLAST was published (8). The two references cited are among the most widely cited references in the technical literature of science. This is an indication of the usefulness of BLAST (and more generally of sequence comparison) for modern research in molecular biology and related fields. Another fact is that the concepts of algorithm and program are intimately connected for BLAST. As explained above, the concepts of algorithm and program are different. Usually, there will be an algorithm with many different implementations of it in particular programming languages. In the case of BLAST, the two notions go together: for practical purposes, there is no distinction between the algorithm and the program. Finally, BLAST has become a family of programs, and in this chapter we will be primarily interested in its most basic versions. Please note also that the BLAST program can be run on the NCBI website (as well as from many other sites), as well as downloaded (free of charge) from the same site and run locally.

The first technically important fact about BLAST is that, as its name says, it is a local alignment algorithm. Another important fact is that BLAST is not guaranteed to find the optimal alignment (best score) for any given comparison. This is the price paid for its efficiency. Let us understand what this means. We have seen that the DP algorithm is guaranteed to find the best alignment between any two sequences given a scoring scheme. As we have also seen, the best alignment is not necessarily statistically significant. On the other hand, there may be more than one statistically significant alignment between two sequences. This can happen because we are dealing here with local alignments. One "small" local alignment may already be significant, and if it is enlarged, its similarity may increase and so probably does its significance. In the context of BLAST, this means that although for a particular pairwise comparison BLAST is not guaranteed to always find the best alignment, almost always it will find a statistically significant alignment between the two sequences if one exists. Practice has shown that even with this caveat, BLAST is extremely useful, and the price of its efficiency is certainly worth paying. (Technically speaking, the lack of a guarantee to always find the best alignment means that BLAST is a heuristic, as opposed to an exact algorithm, as is the case of dynamic programming.)

BLAST is based on the concept of seed alignment. Given an alignment between two highly related sequences (such as the cytochrome c sequence in human and fly), it becomes immediately apparent that they share regions of consecutive matches (in that example, there are seven regions with five or more matches). The idea behind BLAST is based on this observation: given two sequences, first try to find a segment of consecutive matches (or a segment of high score, when comparing protein sequences) between the two; and then try to extend this alignment (the seed alignment) to the left and to the right. The extension proceeds until the inclusion of additional columns to the alignment fails to improve the overall score. If the current score is above a certain threshold, the resulting alignment (a high scoring segment pair (HSP) in BLAST terminology) is output.

Much of the speed of BLAST comes from a highly efficient way of finding seed alignments. The description that follows applies to protein sequences. The key concept here is (again in BLAST terminology) the word. A word is a string of k letters. Given a sequence of length n, it will contain n − k + 1 words of length k. For example, the sequence QCHTVEKGGKHK contains the following words of size k = 3: QCH,CHT,HTV,TVE,VEK,EKG,KGG,GGK,GKH,KHK.

All k-words for a given query can be easily determined; and for each such k-word, all k-words that have an alignment score with it above a certain threshold (using an amino acid scoring matrix) can also be efficiently found and stored (let's call this the query word list). All k-words for all sequences in a database can be precomputed and also stored in a special table. Thereafter, it becomes a matter of finding which of the database words appear in the query word list. Using sophisticated computer science data structures and programming tricks, BLAST is able to efficiently find all k-words of the query that have an alignment score above a certain threshold. The time BLAST takes in this process grows linearly with the size of the query word list as well with the size of the database. In the case of a protein sequence search, the default value for k is 3. For nucleotide comparisons, k = 11 (although the method for finding high-scoring word pairs is different).

The running time complexity of BLAST depends not only on the time spent finding seed alignments but also on the time extending or combining these alignments. I am not aware of a rigorous running time complexity analysis of the complete BLAST algorithm. In practice, however, it runs very fast, perhaps 50 times faster than a good Smith-Waterman implementation for a typical protein-protein comparison.

We will now explain some additional key concepts that users of BLAST should know about. To help assess the quality and significance of an alignment, BLAST reports three types of scores: the raw score, the bit score, and the expect value or e-value. The raw score is simply the score of the alignment based on the scoring system and scoring matrix used.. The bit score is computed from the raw score using a formula that normalizes it with respect to the scoring system; therefore bit scores from searches using different scoring systems are comparable. Finally, the e-value gives the statistical significance of the alignment. Its meaning is: the number of alignments of the same size that could be expected to be obtained purely by chance from a database of a given size using the given scoring matrix. In other words, an e-value of 1 means that we could reasonably expect that the alignment we obtained is due to chance and, therefore, is not biologically meaningful. An e-value of 0.00001 (which is 10⁻⁵, denoted by e-5 in BLAST) suggests that the alignment obtained has a much lower probability of having arisen by chance. An e-value of e-100 suggests that the database hit that was found is almost certainly a homolog.

Note that the concept of e-value is not specific to BLAST. It is perfectly possible to compute e-values for alignments obtained by dynamic programming. It is beyond the scope of this chapter to explain how bit scores and e-values are derived, but let us see what scores and e-values we obtain for the cytochrome c alignments that we have been using as examples (which were obtained by BLAST). First, the human and mouse comparison (the format follows usual BLAST reports):

For this alignment, we obtained a raw score of 517, a bit score of 203, and an e-value of 3e-53, or 3 ×10⁻⁵³, as can be seen in the first two lines. Note that hits in the database are called "subject sequences" in BLAST terminology.

The next alignment is human and fly:

Finally, human and worm:

Notice how the e-values are increasingly larger with increased phylogenetic distance between the organisms (although the absolute value of the exponents -53, 45, 33 in the examples- decreases; this is the source of some confusion). But even at 5e-33, the alignment is still highly significant, which allows us to say with confidence that the two sequences should be homologs. e-values up to e-5 in general suggest homology. From e-5 to e-1, the alignments have to be considered with extreme caution. e-values greater than 0.1 in general suggest that the respective alignments are due to chance. However, please remember that there are a lot of parameters involved (database size, gap costs, scoring matrix, and so on), so what you get is highly dependent on the values used for all of these parameters. In particular, e-values resulting from BLAST searches using different databases are not directly comparable, because the e-value depends on the database size. There are formulas that allow the conversion of an e-value obtained in a query against a database of a certain size to a new value assuming a different database size; such formulas (not covered here) allow an e-value to be given for the BLAST comparison of one sequence against just one other sequence, assuming for example the entire GenBank as the database.

Another important concept for BLAST users is the notion of sequence complexity. Analysis of database sequences has shown that certain subsequences contain repetitive elements or appear with relatively high frequency as parts of other sequences. Because of this, it is not uncommon for the local alignment of such sequences or regions to be due entirely to chance. Such sequences are given the designation of low complexity sequences. An example of a low complexity DNA sequence is AAAAAA (which is more or less obvious); an example of a low complexity amino acid sequence is AGNLLGRNVVVVGAG (which is far from obvious). To avoid the (possibly) artificial increase in score in an alignment that would be caused by the presence of low complexity sequences, BLAST provides a low complexity filter, meaning that even if present, a sequence classified as low complexity does not contribute to the total score. When the low-complexity filter is used and a low-complexity sequence is present in the alignment, it appears masked in the result with the character N (if nucleotide) or X (if amino acid); or the low complexity region is highlighted in a different color and in lowercase. See an example below.

Note that sometimes the use of the low-complexity filter will unduly lower the score; this can happen when the entire alignment is not due to chance, and both query and subject sequences possess a matching region classified as low complexity. To avoid this problem, BLAST provides a soft masking alternative: low-complexity sequences are masked during the seed alignment phase but are present normally during alignment extension.

The reader should bear in mind that even with a stringent e-value threshold and use of the low-complexity filter, there can still be false matches (also called false positives). A partial example is given below:

Only the first two rows of the alignment are given. Notice that the e-value appears to be very good. But then note that there seems to be a high frequency of leucines in both the query and the subject. In fact, both query and subject are leucine-rich repeat proteins, although one of them is found in the bacterium Leptospira interrogans and the other is found in mouse (Mus musculus). It is unlikely that these two proteins are homologs. Turning on the low-complexity filter will not eliminate this false positive, as can be seen below (low-complexity sequences appear in lowercase; only the first two rows of the alignment are shown):

The score has decreased, but the e-value still suggests the alignment is significant.

As we have seen, BLAST has an extension phase, when seed alignments are extended with the goal of improving the score. During this phase, gaps may be introduced, and gap inclusion incurs penalties on the score given by an affine function, as described in the section Finding the Similarity. BLAST users can change the values of the constants for gap opening (G) and gap extension (E). The default values for proteins are G = 11 and E = 1. (BLAST convention gives these constants as positive values, but they are penalties exactly as described in the section Finding the Similarity.) However, only certain combinations of values for G and E are allowed, and they depend on the scoring matrix being used. For example, in the case of BLOSUM62, G can vary in the range [6,12], whereas E can be either 1 or 2. Such limitations are necessary to preserve the statistical properties of alignments.

We finish this section by mentioning BLAST types and BLAST variants. Biological sequences are of two kinds: nucleotides and proteins. BLASTN is the type of BLAST used to compare nucleotide sequences. BLASTP is used to compare protein sequences. In addition to these comparisons, it turns out that it is extremely useful to be able to compare nucleotide sequences to protein sequences. Today it is much easier and cheaper to sequence DNA than it is to sequence a protein directly. Therefore, a crude (but effective) method to determine whether a DNA sequence codes for a protein is to translate the DNA sequence in all of its six frames and compare each of those translations to known protein sequences. BLASTX is a type of BLAST that does exactly that: given a DNA sequence, it translates it in all six frames and compares each translation to the protein sequences in the database. TBLASTN is similar, but the query is assumed to be protein and the database is assumed to be in nucleotides. Finally, TBLASTX compares query nucleotide sequences against database nucleotide sequences but translates both query and database sequences in all 6 frames before comparing them. It is the most costly (in computer time) of the BLAST family of programs, as one would expect.

In addition to these types of BLAST, there are many BLAST variants. These are programs inspired by BLAST but that use different algorithms and/or perform specialized sequence comparisons. Two noteworthy ones are:

MegaBLAST: It is for nucleotide comparisons only, and it is aimed at comparing sequences that are already known to be similar. In such cases, MegaBLAST can be much faster than BLASTN and is able to handle much longer sequences.
PSI-BLAST: It stands for position-specific iterated BLAST and is meant for protein sequences. The full understanding of this variant requires the concept of position specific scoring matrix, which is beyond the scope of this chapter. However, the idea is as follows: given a set of related protein sequences (a family), it is possible to align all of them together (see below for mulTiple alignment) and capture the variations observed in each column (or position) in a matrix. With such a matrix, we can search for additional members of the family in a much more sensitive fashion than we would with each member of the family individually as a query. We can then use the members found to refine the matrix and repeat the process. This iterative search process will likely increase the sensitivity even further. This is the processing that PSI-BLAST offers.

A Note on Global Alignments

We have seen the difference between global and local alignments; we have also seen that BLAST finds local alignments. For most sequence comparison needs, a local alignment is all that is needed; and in many cases BLAST (and several other local alignment programs) returns alignments that are "nearly" global (in the sense that the alignment includes nearly the whole of query and subject). That was the case with the cytochrome c alignments that we have shown previously. There are, however, situations where a strict global alignment is required. One such example is the comparison of proteins from the same protein family. We need to know how different members of the family relate to each other in an exact quantitative fashion; if one member is shorter than another, there has to be a penalty that will account for the unmatched portion of the longer one. Only with a global alignment will we be able to achieve this.

Sequence Comparison for Very Large Sequences

As we have seen, the running time of the dynamic programming algorithm for sequence comparison is proportional to the product of the lengths of the sequences being compared. The main reason for this, as we explained, is the size of the DP matrix. This matrix needs to be stored in the main computer memory (RAM) and takes a lot of space for large sequences. Although there are certain tricks that can decrease the amount of memory required, when comparing very large sequences, the bottleneck in using the DP algorithm is usually lack of memory and not excessive time. This means that even when we can afford the time to execute the DP algorithm, our computer may not have enough memory to run it. This in turn means a limit on the sizes of sequences that we may be able to compare using DP.

The BLAST algorithm/program was not designed to compare large sequences, either. We have already mentioned the variant MegaBLAST, which is able to handle much larger sequences. However, what exactly do we mean by "large sequences"?

The typical protein has around 300 amino acids, but there are some that are 10 times longer. A prokaryotic protein-coding nucleotide sequence for a 3,000-long protein sequence will be 12 kb long. Both DP and BLAST can handle these kinds of sequences without any problems. Problems begin to appear when we want to compare full genomes, or entire chromosomes. These are sequences that have millions of nucleotides, and DP, BLAST, and even megaBLAST will "choke" if asked to compare sequences this long.

Given that full genome sequences have become common, and that comparing them is just as important as it is to compare individual genes, what are we to do? The answer lies in the clever use of special computer science techniques and data structures. One such structure is the suffix tree. It is beyond the scope of this chapter to explain what a suffix tree is. Suffice it to say that it allows memory- and time-efficient detection of exact substring matches between two sequences. It is an idea similar to the seed alignment of BLAST, but with a completely different implementation.

A well-known program that uses the suffix tree technique is called MUMmer (10). This program finds all Maximal Unique Matches (MUMs) with a user-defined minimum length between two DNA sequences. The important thing to remember is that a MUM is an exact match. But just as is the case of BLAST, experience has shown that long approximate matches (those that contain mismatches and indels) contain regions where exacts matches are found. So MUMmer creates the whole alignment by finding only the component exact matches (the program also tries to link neighboring exact matches by applying the Smith-Waterman algorithm). Thanks to a carefully crafted suffix tree implementation, MUMmer is able to align whole bacterial genomes or entire eukaryotic chromosomes in a matter of minutes using reasonable amounts of computer memory (4 Gigabytes seems to be enough for most applications), assuming the default length of a MUM (25 nucleotides).

MulTiple Sequence Comparison

This section will give a basic introduction to the vast subject of mulTiple sequence comparison (MSC). MSC is a natural extension to the concept of pairwise comparison that we have dealt with so far. Given k > 2 sequences, we want to know how similar they are to one another and obtain a mulTiple alignment, which is the obvious extension of the pairwise alignment that we are already familiar with. For example, Figure 5 shows a mulTiple alignment of several cytochrome c sequences.

Figure 5

MulTiple alignment of cytochrome c sequences. C, Caenorhabditis elegans (the worm); Y, Saccharomyces cerevisiae (yeast); A, Arabidopsis thaliana (a model plant); D, Drosophila melanogaster (the fly); H, Homo sapiens; M, Mus musculus (mouse). The format (more...)

There are two questions that we will address here. The first is: Why would anyone want to compare mulTiple sequences and create a mulTiple alignment? Can't we get by with pairwise alignments only? The second is (assuming there is a good answer to the first!): How can we create a mulTiple alignment, and preferably a "good" one?

For the first question, the answer depends on the kinds of sequences that one wants to compare. A simple case is illustrated by the mulTiple alignment in Figure 5. From the pairwise alignments we already know that the input sequences are related. The primary reason we want to align them all together is to see what they have in common. Such information is not easy to derive from the pairwise alignments. For example, a simple visual inspection of Figure 5 immediately tells us that all cytochrome c sequences have one section of 11 amino acids that is exactly the same in all of them (NPKKYIPGTKM), although the sequences belong to extremely different biological species. Such information can only be obtained by a mulTiple alignment and is extremely valuable to those studying the evolution of proteins. Incidentally, such studies usually lead researchers to build phylogenetic trees, and the basis of most phylogenetic tree reconstruction methods is a mulTiple alignment.

Another situation arises when we are not sure whether certain sequences are indeed related to one another. Each pairwise score or e-value may be weak. By comparing all sequences together, we may discover that all of them share certain regions, which will give us much more confidence that they are indeed related. This situation arises when we are comparing homologous sequences from species that are quite distant to each other (as is the case with the species used in the mulTiple alignment of Figure 5) but that are not so well conserved (as is not the case with cytochrome c, which is very conserved even among distant species, as the alignment shows us).

Because of the need for mulTiple sequence alignments, we will give a brief sketch of the methods used to perform such comparisons. The simple dynamic programming algorithm that we have described can be extended to compare k > 2 sequences. We will not explain how this can be done; please see the references. The resulting time and space complexity are proportional to at least 2^kn^k , where n is the length of each sequence (assuming all are of the same length). This means that DP is impractical in most cases. The alert reader will probably now ask: isn't there the equivalent of BLAST for mulTiple sequence comparison? The answer is, yes and no.

It is "no" because there is no such thing as a BLAST-based or BLAST-inspired program for mulTiple sequence comparison. The answer is "yes" in the sense that there are heuristic programs for MSC that are able to tackle the MSC problem for relatively large values of k (for example, 50) and relatively large values of n (for example, 5 kb). However, such programs do not have guarantees that they will find the optimal mulTiple alignment, more or less in the same sense that BLAST is not guaranteed to find the optimal pairwise alignment.

One additional fact about mulTiple alignments worth knowing (at this introductory level) is how they are scored. The most basic scoring system for mulTiple alignments is an extension of the pairwise scoring system that we have described and is called sum of pairs. For each column in the alignment, we add all pairwise scores of characters present in that column. So for example, if a column has two isoleucines and one valine, the score of that column will be score(I,I) + score(I,V) + score(I,V). The one exception is that if there are spaces in a column, we do not charge for the situation in which a space is paired with another space. The total score of the alignment will be the sum of the column scores.

There are several programs that perform mulTiple alignments. Three of the best known are: clustalW (11), T-coffee (12), and MUSCLE (13). The introduction given in this chapter should be enough for the interested reader to be able to understand the outputs of these programs. Please note, however, that these programs are all heuristics, and that humans are still able in some cases to improve mulTiple sequence alignment results given by these programs by editing the alignments directly. To do so, however, requires considerable expertise in the structure and evolution of proteins.

Practical Section

The following are solved exercises as well as Tips and tricks for the application of the theoretical knowledge covered in the theoretical portion of this chapter.

1. Why sequence comparison is important

Identity, similarity, percent identity

Q1.1:Consider the following alignment of the first part of two hemoglobin sequences, one human (H), one mouse (M):

Legend:

| identical

: very similar

. similar

As a first measure of similarity, what is the percent identity?

A1.1:

The percent identity is 70% (7 matches over 10 positions).

Q1.2:

You see 4 different alignments reported. Which alignment is the best?

100% (10/10)

91% (95/104)

74% (80/108)

53% (59/111)

A1.2:

It is important to not only consider the percent identity but also the length of the alignment. In our case the best alignment would be b). This is why alignment programs report scores (not shown here) which take both (matches/mismatches and length of the alignment) into consideration.

As a general rule of thumb we can consider the following alignments significant:

For DNA sequences:

more than 70 % identity over more than 100 bases.

For proteins:

more than 25 % identity over more than 100 amino acids.

Most importantly: have a look at the actual alignment and decide if it has any biological significance. Amino acid sequences after all make up three-dimensional proteins. A very good alignment of two coil regions tells you that these two proteins look similar whereas a good alignment of the catalytic residues tells you that the proteins probably have the same function.

2. What is sequence similarity

Score DNA sequences

Q2.1:

Calculate the score for the following alignment:

Each match scores 5 points, each mismatch costs a penalty of 4 points.

A2.1:

The score is 0.

12 matches (12 x 5 = 60) plus 15 mismatches (15 x −4 = −60).

Homology

Q2.2:

Is it correct to say that human and mouse cytochrome c are 91% homolog?

A2.2:

Homology is a hypothesis about a common ancestry of two sequences. Two sequences may have descended from a common ancestor or not. In the former case, they are homologues, and in the latter not. Therefore, there is no sense in stating homology in terms of percentage. It is either true or false. In addition, homology cannot be inferred using sequence similarity as the only criterion.

3. Finding the similarity between two sequences by dynamic programming

Gap penalties

Q3.1:

Now let's look at a complete alignment, but this time you will do the alignment of the following hemoglobin sequences:

Either copy them from this document or retrieve them from the NCBI (http://www.ncbi.nlm.nih.gov/). At the NCBI search `protein' for either the GI number (e.g. 4504347) or the accession number (e.g. NP_000549.1):

On the results page set the display to `FASTA':

Align the two sequences over their complete length (global alignment), using `needle` (http://www.ebi.ac.uk/emboss/align/), an implementation of the Needleman-Wunsch algorithm.

Perform the alignment a number of times, varying the parameters in the following way:

stick to the default parameters - Gap_penalty: 10.0, Extend_penalty: 0.5
be conservative (i.e. gaps are a very rare event) - Gap_penalty: 100.0, Extend_penalty: 10.0
be very flexible (i.e. gaps can occur frequently) - Gap_penalty: 1.0, Extend_penalty: 0.1

Have a look at the resulting alignments (percent identity, percent similarity, percent gaps, score).

Note that each result of `needle' tells you exactly, what you have done, thus making it easy to reproduce these results.

Don't forget that we are looking at proteins. Their function is based on their 3D structure. Introducing gaps in an alignment means that at that position an insertion happened. An insertion in a helix structure is much more disruptive than in a coil structure.

What do you observe?

A3.1:

At the top is the summarized alignment and gives you a general idea:

# Gap_penalty: 10.0

# Extend_penalty: 0.5

# Length: 142

# Identity: 122/142 (85.9%)

# Similarity: 131/142 (92.3%)

# Gaps: 0/142 ( 0.0%)

# Score: 648.0

Further down you can find the actual alignment. Please check it and don't rely on the summary alone.

The default parameters produce a good alignment.

# Gap_penalty: 100.0

# Extend_penalty: 10.0

# Length: 142

# Identity: 122/142 (85.9%)

# Similarity: 131/142 (92.3%)

# Gaps: 0/142 ( 0.0%)

# Score: 648.0

This is a conservative approach. As a) already didn't contain any gaps nothing changes here. This means our alignment is quite robust.

# Gap_penalty: 1.0

# Extend_penalty: 0.1

# Length: 146

# Identity: 124/146 (84.9%)

# Similarity: 133/146 (91.1%)

# Gaps: 8/146 ( 5.5%)

# Score: 656.0

This is a flexible approach. A number of gaps get introduced. The score of this alignment is higher than in a) or b) but does this make biological sense? In a) and b) we postulate that insertion/deletions are rare and differences in the sequences are due to mutations. In c) we say that insertions/deletions happen quite frequently. It's your biological knowledge that in this case has us preferring a)/b) over c).

Local alignment

Q3.2:

Use our mouse (GI 145301578) and human (GI 4504347) hemoglobin sequences. Find the best regions of similarity between the two sequences using the program `water' (http://www.ebi.ac.uk/emboss/align/), an implementation of the Smith-Waterman algorithm. Compare the results with the alignments generated with the program `needle'.

A3.2:

The two sequences can be nicely aligned over their entire length. The best region of similarity stretches from one end of the sequence to the other. In this case global and local alignments present the same result.

Q3.3:

Let's stick with mouse hemoglobin and align mRNA (GI:145301577)and genomic (GI:49899) sequence using `water' (http://www.ebi.ac.uk/emboss/align/). Vary the gap penalty in the following way:

stick to the default parameters
be conservative (i.e. gaps are a very rare event)
be very flexible (i.e. gaps can occur frequently)

What do you observe?

A3.3:

There are regions of aligned characters and regions of gaps. These correspond, respectively, to exons and introns.

Gap_penalty: 10.0

Extend_penalty: 0.5

3 exons, 2 introns, nicely separated

Gap_penalty: 100.0

Extend_penalty: 10.0

1 exon, 1 intron, matches, mismatches and gaps all mixed.

Open gap penalty 1.0

Gap extension penalty 0.1

3 exons, 2 introns, nicely separated

Tip

Sim4 (http://pbil.univ-lyon1.fr/sim4.php) is a specialized tool for aligning expressed DNA sequences with their genomic sequences. A slightly newer tool is Exonerate. It can be used at http://sss.hgc.jp/ as part of the Sequence Similarity Search service of the Human Genome Centre, Japan. The program runs faster than sim4 but with the same accuracy.

Make sure that you are using alignment tools appropriate for your organism of research. E.g. sim4 and exonerate are meant to be used for eukaryotes with their intron/exon structure and not for prokaryotes.

Global alignment

Q3.4:

Now use the mouse mRNA (GI:145301577) and genomic (GI:49899) sequences for hemoglobin and align them using `needle` (http://www.ebi.ac.uk/emboss/align/). Vary the gap penalty:

stick to the default parameters

be conservative (i.e. gaps are a very rare event)

be very flexible (i.e. gaps can occur frequently)

What do you observe?

A3.4:

There are regions of aligned characters and regions of gaps. These correspond to exons and introns.

# Gap_penalty: 10.0

# Extend_penalty: 0.5

Alignment covers the complete length of both sequences.

3 exons, 2 introns, nicely separated

# Gap_penalty: 100.0

# Extend_penalty: 10.0

To be more conservative we assign high gap penalties. Now the alignment doesn't cover the complete length of both sequences.

1 exon, matches, mismatches and gaps all mixed.

# Gap_penalty: 1.0

# Extend_penalty: 0.5

To be more flexible we reduce the gap penalties to very small values. Now, the alignment covers the complete length of both sequences.

3 exons, 2 introns, nicely separated

NOTE: The scores used in b) and c) are examples and should not be considered universal values for being conservative or flexible.

Dotplot

A great tool to give you an overview of all possible alignments of two sequences is Dotlet (http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html). It takes your sequences and writes one sequence along the x-axis and the other one along the y-axis. Next, it simply checks what residues both sequences have in common and puts a dot down for each match. A perfect alignment consists of an uninterrupted line of dots (matches). It allows you to consider neighbouring residues by specifying a word size.

To illustrate the underlying principle check this little example (each match is represented by the letter x):

The best local alignment is the longest diagonal (represented in red for clarity) which correspond to the alignment:

Dotplots of single bases tend to have too many positions filled up. We can use the same technique using a sliding window of bases, instead of comparing one base at a time. In the sliding window version, a sequence is scanned and converted into tuples and a comparison of tuples is compared instead of single bases. In the example above, using a window of two bases, we will have, for the first sequence the pairs

AA, AC, CT, TA, AG, GC

and for the second sequence the pairs

CG, GA, AT, TC, AA, TG, GG, GC

The effect of using a window is to lower the resolution of the dotplot. This is particularly useful if you are aligning two large sequences and are interested only in larger alignments. In our example, the dotplot with a window of two bases will be

Please note that an x is put in the diagram if there is a regular match or if the pair matches its reverse (NOT reverse complementary)

For our mRNA and genomic mouse hemoglobin sequences, we get the following result:

For more details, instructions and examples check the ` need help?' and `learn by example' links.

4. Amino acid scoring systems

BLOSUM number

Q4.1:

In Exercise 3.1, you aligned mouse (GI:145301578) and human (GI:4504347) hemoglobin sequences using the default scoring matrix BLOSUM62. Knowing that these hemoglobins diverged recently you want to fine-tune your alignment by using another scoring matrix. Would you go for a BLOSUM matrix with a higher or one with a lower number?

A4.1:

Recall what you've learned in the theoretical part of this chapter. Matrices help the computer to calculate the best alignment. A high BLOSUM matrix was created from a group of protein sequences with a high percent of similarity. In other words: the higher this percentage, the closer these sequences are (or the more recent they diverged). The better a chosen matrix fits your sequences, the better your result will be. While the choice of the appropriate scoring matrix increases the specificity and sensitivity of your alignments, it is still up to you to verify the biological significance of the corresponding result. If we would be looking for long and weak alignments we would choose a low BLOSUM matrix like BLOSUM40.

Matrices

Q4.2:

Use our mouse (GI:145301578) and human (GI:4504347) hemoglobin sequences. Align the two sequences over their complete length (global alignment), using `needle` (http://www.ebi.ac.uk/emboss/align/).

`Needle`, as implemented by EMBOSS, allows you to use three different BLOSUM matrices: 62, 50 and 40. Other matrices are available in other implementations. Perform the alignment a number of times, varying the scoring matrix every time:

stick to the default BLOSUM62 matrix
proteins have diverged some time ago(use a BLOSUM slightly lower than 62)
proteins have diverged a long time ago (use the lowest BLOSUM available)
For proteins that diverged very recently use a BLOSUM higher than 62.
Note: `Needle` does not offer matrices higher than BLOSUM62

Have a look at the resulting alignments (percent identity, percent similarity, percent gaps, score).

Note that each result of `needle` tells you exactly, what you have done, thus making it easy to reproduce these results.

Next, repeat the exercise with the human hemoglobin (GI:4504347) and a worm globin (GI:17541936).

A4.2:

Mouse/human (diverged 100 million years ago)
# Matrix: EBLOSUM62
# Identity: 123/142 (86.6%)
# Similarity: 131/142 (92.3%)
# Gaps: 0/142 ( 0.0%)
# Matrix: EBLOSUM50
# Identity: 123/142 (86.6%)
# Similarity: 131/142 (92.3%)
# Gaps: 0/142 ( 0.0%)
# Matrix: EBLOSUM40
# Identity: 123/142 (86.6%)
# Similarity: 135/142 (95.1%)
# Gaps: 0/142 ( 0.0%)

a), b), c) do not really differ because our sequences can be aligned very robustly. Note that BLOSUM62 considers an Alanine (A) - Glycine (G) pair similar `.' whereas BLOSUM40 considers this very similar `..'.

Human/worm (diverged 1000 million years ago)

# Matrix: EBLOSUM62
# Identity: 29/370 ( 7.8%)
# Similarity: 49/370 (13.2%)
# Gaps: 256/370 (69.2%)
# Matrix: EBLOSUM50
# Identity: 33/372 ( 8.9%)
# Similarity: 57/372 (15.3%)
# Gaps: 260/372 (69.9%)
# Matrix: EBLOSUM40
# Identity: 33/372 ( 8.9%)
# Similarity: 64/372 (17.2%)
# Gaps: 260/372 (69.9%)

Notice that the position of gaps in the different alignments changes. The lowest BLOSUM available (BLOSUM40) gives the best results as humans and worms diverged much longer ago than humans and mice.

5. Sequence databases and statistical significance of alignments

Tip

It is advisable to search databases of the highest quality available and to restrict the search, if possible, just to the data you are interested in.

Let's say, for example, that we want to find human proteins similar to a mouse hemoglobin. We would then restrict the search to H. sapiens and use a high quality, manually curated, protein database like SWISS-PROT.

For DNA sequences I would suggest to leave out low quality high throughput sequences like ESTs and first try with a database like NR, which is non-redundant (i.e. it doesn't contain duplicate sequences).

6. BLAST

At http://www.ncbi.nlm.nih.gov/BLAST/ you will find a variety of BLAST programs. We can distinguish between BLAST for assembled genomes, basic BLAST and specialized BLAST.

Q6.1:

Use mouse cytochrome c protein (GI: 6681095) and search SWISS-PROT for related sequences. Report sequence similarity between mouse and chicken cytochrome c.

A6.1:

On the main BLAST page select `protein blast' from the `Basic BLAST' section. The BLASTP page opens up. You can run BLAST with as little or as much customization as you want. Each option is explained (click on the little question marks). Enter the GI number, select the database (SWISS-PROT) and make sure blastp is the selected program:

Notice that parameters that differ from the default (in our case the database choice) are highlighted in yellow and will be remembered for your next database search.

Hit the `BLAST' button to start the database search. You will get a confirmation message, which gets refreshed in regular intervals till the job is completed. Once the search is finished you get presented with a graphical overview of the results. This graphic is followed by a table which contains the same information. The sequence names are linked to their database entries, the scores are linked to the respective alignment with the query sequence, and the little coloured boxes at the end of the lines are links to related entries in other databases. As cytochrome c is a well conserved protein we get lots of very good hits covering the complete length of our query sequence. A perfect alignment has a high score, a low E-value and covers the complete query sequence.

Find chicken cytochrome c:

The two sequences are 96% similar.

Other options to explore on the BLAST results page:

Taxonomy reports
Click on `Taxonomy reports' to see how the database hits are distributed across different lineages, organisms, or taxa.
Mouse and chicken have been manually highlighted yellow.
Distance tree of results
Click on `Distance tree of results' to see how similar the different database hits are to each other:
Choose a different format for the results
To see only differences between your query sequence and database sequences, click on `Reformat these Results'. On the new page choose `Pairwise with dots for identities' as the `Alignment view':

The result is:

Tip

When you look at the various alignments a BLAST search produces,

first
	look at the statistical significance (E-value)
then
	at the score
then
	at the percent identity
and finally
	at the distribution of identical residues.

blast2seq

Q6.2:

Use our hemoglobin sequences from above (GI:4504347, GI:145301578) and find the right BLAST flavour (http://www.ncbi.nlm.nih.gov/BLAST/) to do a pairwise alignment of them.

Does the resulting alignment differ from the alignments we got using `needle'/'water`?

A6.2:

The option to use is `Align two sequences using BLAST (bl2seq)' (check the `Specialized BLAST' section at the bottom of the main BLAST page). As a rule, blast2seq will return results very similar to `water', since they are both performing local alignments. In our particular case, the result will also be similar to that of needle, even though this last tool performs a global alignment. This is due to the fact that our sequences are very similar over their entire length, making the "global" and the "local" alignments alike.

Word size

Q6.3:

The `word size` parameter determines the sensitivity of your similarity search. BLAST has a quick look at your query sequence and each database sequence. They have to have at least a stretch of sequence of `word size' in common (100% identity) to be considered for an alignment.

Your research found the sequence TTCTGA to be very interesting. Let's see if `word size` influences if BLAST finds this sequence. Use pen and paper and write down all BLAST words of length 11 for the following query sequence: CACTTCTGATTCTGA

A6.3:

CACTTCTGATTCTGA is composed of the following 11 letter words:

Our sequence of interest (TTCTGA) is only 6 nucleotides long. BLAST will only find it if your query sequence and a database sequences have the above word TCTGATTCTGA in common. If the database sequence reads e.g. TCTGGTTCTGA this will not be the case despite the fact that the sequences have TTCTGA in common.

Q6.4:

Now decrease the word size to 6 and write down all the words.

A6.4:

CACTTCTGATTCTGA is composed of the following 6 letter words:

This time you can be guaranteed that if your query sequence and a database sequence have TTCTGA in common it will be found.

Note: The more sensitive (i.e. the smaller `word size`) a search is the longer it will take to finish. The BLAST variant megablast uses a `word size' of 28 to increase its speed. Megablast is used for highly similar sequences and can therefore compromise some sensitivity. As a rule of thumb: do not use a `word size` bigger than the smallest feature you are looking for.

Bit score, raw score, E-value

Q6.5:

You've been reported bit score, raw score, E-value for the following 5 alignments.

Score = 155 bits (393), Expect = 4e-37

Identities = 79/150 (52%), Positives = 101/150 (67%), Gaps = 1/150 (0%)

Score = 74.3 bits (181), Expect = 2e-12

Identities = 36/105 (34%), Positives = 57/105 (54%), Gaps = 0/105 (0%)

Score = 347 bits (889), Expect = 1e-94

Identities = 167/167 (100%), Positives = 167/167 (100%), Gaps = 0/167 (0%)

d)

Score = 189 bits (480), Expect = 3e-47

Identities = 88/151 (58%), Positives = 114/151 (75%), Gaps = 1/151 (0%)

Score = 89.7 bits (221), Expect = 4e-17

Identities = 52/127 (40%), Positives = 68/127 (53%), Gaps = 5/127 (3%)

Sort them by relevance as BLAST would do. Using the rule of thumb, which alignments are significant?

A6.5:

The BLAST order would be c), d), a), e), b).

All alignments above are significant and worthwhile assessing the degree of similarity and the possibility of homology.

Remember the rule of thumb:

E-value smaller than e-5 are usually significant.
We are unsure if the e-value is between e-5 and e-1.
E-values higher than 0.1 are usually not significant.

Filtering

Q6.6:

You are looking for all similar nucleic sequences in human. Choose the correct BLAST flavour (http://www.ncbi.nlm.nih.gov/BLAST/) and search the following sequence against `human genomic plus transcript':

Switch all filtering options on
Switch filtering options off

How do the results differ?

A6.6:

Choose BLAST against the assembled human genome. Then choose megablast for a quick overview or blastn for higher sensitivity.

Our query sequence contained some low complexity regions. We are normally not interested in these repeats but rather in sequences that code for specific functions. With filtering switched on we concentrate on functional similarity, whereas with filtering switched off we concentrate on finding other sequences with repeats.

a) BLAST reports some hits. Click on the `Genome View' button to see how the database hits are distributed across the human genome:

After some time you will get:

`ERROR: An error has occurred on the server, Too many HSPs to save all'.

Filtering allowed us to get a manageable number of database hits.

BLASTN

Q6.7:

Search (http://www.ncbi.nlm.nih.gov/BLAST/) the NR database for similar sequences to the mouse hemoglobin sequence. What do you get reported? At what e-value would you draw a line between significant and non-significant hits?

A6.7:

We hit some sequences that are described as hemoglobin. Most sequences are clones without functional description. At least the first 100 hits are relevant. Their E-value is below e-5 and they all cover more than 100 residues.

Tip

Be very careful when you get hits on mulTiple chromosomes of the same genome as this normally doesn't indicate functional relationship.

Q6.8:

In a book you read about the reconstruction of extinct species. The author gives a DNA sequence and claims that it is from a mammoth. Please check (http://www.ncbi.nlm.nih.gov/BLAST/) the following sequence:

A6.8:

The author must have made a mistake as only cloning vectors get reported as hits.

BLASTP

Q6.9:

Search (http://www.ncbi.nlm.nih.gov/BLAST/) SWISS-PROT for similar sequences to the mouse hemoglobin sequence. What do you get reported? At what e-value would you draw a line between significant and non-significant hits?

A6.9:

You get back lots of other hemoglobin sequences. At least the first 500 hits are relevant. Their E-value is below e-5 and they all cover more than 50 residues.

Tip

Look for matches which cover structurally relevant distances. As a rule of thumb domains are 50 residues in size.

BLASTX

Q6.10:

You obtained the following DNA sequence:

You are interested in similar sequences that have a high-quality annotation. You decide to give SWISS-PROT a try. Which is the appropriate flavour of BLAST (http://www.ncbi.nlm.nih.gov/BLAST/)? What function does our DNA sequence have?

A6.10:

From the `basic BLAST' section choose BLASTX, as it translates your sequence in all six reading frames, looking for similarities at the amino-acid level. The sequence is a chicken hemoglobin sequence.

TBLASTN

Q6.11:

You managed to sequence the mouse hemoglobin protein. Now you want to know on which chromosome the gene coding for this protein is located. Use TBLASTN from the `Basic BLAST' section at http://www.ncbi.nlm.nih.gov/BLAST/ to compare your protein (GI:145301578) against NR translated in all six reading frames. Where is the mouse hemoglobin gene located?

A6.11:

When you click on the small blue box containing a G letter, just after the E-value of the first hit, you get information about the genomic location, which in this case is chromosome 11.

Protein vs. DNA sequences

In all these searches keep in mind that DNA sequences are less conserved than amino acid sequences, which in turn are less conserved than the structures of the corresponding proteins.

If you want to predict the function of a sequence and you know the open reading frame, use the encoded protein sequence for similarity searches. If you are interested in synonymous (silent) and non-synonymous mutations use the DNA sequences.

MegaBLAST

Q6.12:

We know that besides the mouse hemoglobin nucleotide sequence (GI:6680174), there are more hemoglobins in the NR database. To quickly pull them out we use MegaBLAST (http://www.ncbi.nlm.nih.gov/BLAST/). Search NR for similar sequences to the mouse hemoglobin mRNA sequence. What do you get reported? At what e-value would you draw a line between significant and non-significant hits?

A6.12:

We get lots of hemoglobin sequences from different species reported. At least the first 100 hits are relevant. Their E-value is below e-5 and they all cover more than 100 residues.

PSIBLAST

Q6.13:

You wonder what other protein families are closely related to hemoglobins. To find out you decide to use PSI-BLAST with mouse haemoglobin as query sequence (select `protein blast' from the `Basic BLAST' section, then select PSI-BLAST on the new webpage). PSI-BLAST performs a normal BLAST to find sequences similar to your query sequence (mouse hemoglobin) and combines them into a broader, more flexible description (hemoglobin protein family). It then uses this broader description to search for related sequences. This process can be repeated as often as you like (or until no additional new related sequences can be found). Which families seem to be closely related to hemoglobins?

A6.13:

The first iteration of PSI-BLAST is a normal BLAST and reports exclusively hemoglobins. Click on `Run PSI-Blast iteration 2' to summaries the hits and use this summary (also called position-specific scoring matrix) for the next round of searches. The second iteration reports some new sequences (yellow `NEW' tag). When having a look at them you will see that there are now also myoglobins, cytoglobins, and globins included.

Note: Keep in mind that incorporation of unrelated sequences contaminates the results and will lead to more and more unrelated sequences being found.

BLAST for primers

Q6.14:

You found the optimal set of primers for a PCR experiment:

LEFT PRIMER AAGGAAAAGGCCAGTAAGGC

RIGHT PRIMER CTCCCCAAGCCTCCTGTAG

To make sure they only prime at the desired site you decide to check that.

Which human gene are we working on? Are you happy with the specificity of the primers?

A6.14:

Tip

BLAST searches both strands for the best regions of similarity. Therefore, we do not have to perform two searches with our primer pair, but can simply concatenate the forward and reverse primer and check them in a single search:

>primer

AAGGAAAAGGCCAGTAAGGCCTCCCCAAGCCTCCTGTAG

Go to the BLAST program selection guide. To do that click `Help' on the main BLAST page:

Then select `BLAST program selection guide':

Then, follow the "Program selection table" link:

The `NUCLEOTIDE' table lists `Find primer binding sites or map short contiguous motifs'. Click on `Search for short.nearly exact matches'.

A BLASTN window opens. To speed up the search you can restrict yourself to human sequences (`Human genomic + transcript'. Note that as adjustment to short query sequences the expectation value is increased and the word size is decreased. In the results window check for hits covering the complete query sequence. You can see that your complete primer sequence gets reported as forward and reverse primer together with flanking features (target gene): fibroblast growth factor. Our target gene is on chromosome 12 and only here do both primers bind sufficiently close together.

Tip

Check where your results are positioned on the human genome by clicking the `Genome View' button (just above the BLAST graphical results). To find the correct hits check for high scoring hits (use the color key or the table of results).

More BLAST

For more information and exercises check the following resources (in that order):

Blast Quick Start (click on P or H for the self scoring exercises) (http://www.ncbi.nlm.nih.gov/Class/minicourses/blastex2.html)

Blast information guide (http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html)

The Statistics of Sequence Similarity Scores (http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html)

More Tips

Tip

For more complicated cases you should always use a specialized alignment tool and not a database search tool like BLAST which is optimized for speed and not for accuracy.

Tip

When reporting a very important result show the alignment which you have refined using an alignment tool as well as manual curation. All other information such as scores and E-values are just a more or less crude summary.

Tip

If you want to keep updated on similar sequences for your query sequence you can sign up for the free service myNCBI (http://www.ncbi.nlm.nih.gov/entrez/cubby.fcgi?call=QueryExt.CubbyQuery..ShowAll.

Tip

You can restrict your results not just to an organism, but to any dataset you would like. Just use NCBI's search tool Entrez to specify what you are looking for, and then choose `Limit by entrez query' from the advanced BLAST options. Click on the link to get instructions as to how exactly this works.

7. Sequence comparison for very large sequences

To perform rapid whole genome alignments, one can use Mummer. Mummer is not available as a web service, but can be downloaded for free from http://sourceforge.net/projects/mummer. Note that it cannot be installed on Windows. Once installed on a UNIX-like operating system you simply type in: `mummer sequence1 sequence2' to get two sequences aligned.

The following is a sample screenshot from Mummer's webpage and shows the comparison of two bacterial genomes:

8. MulTiple sequence alignment

Where to find ClustalW, T-coffee, MUSCLE

ClustalW can be used online at http://www.ebi.ac.uk/clustalw/. You can also download the software from the same site.

T-coffee can be used online at http://www.ch.embnet.org/software/TCoffee.html :

MUSCLE can be used online at http://www.bioinformatics.nl/tools/muscle.html:

Tip

From the MUSCLE webpage: "Evaluated on the BaliBASE benchmark (which includes 142 reference alignments and more than 1,000 sequences), MUSCLE delivers a higher score than T-Coffee (until now considered the most accurate technique) while requiring only 20 seconds of CPU time on a typical desktop computer. By comparison, T-Coffee requires 1,500 seconds. The popular ClustalW algorithm takes 170 seconds, but it generates a lower score (reflecting less accurate alignments) than either MUSCLE or T-Coffee." BAlibase can be found at http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/

How to edit alignments

Computers do not do biology. If you are the biological expert, you should use your expertise to correct mistakes. Often there are several possible alignments with the same score and you can make sure the correct one is chosen. A free of charge alignment editor is BioEdit (http://www.mbio.ncsu.edu/BioEdit/bioedit.html) which runs on Windows. A screenshot of the program interface is shown below:

BioEdit even allows you to toggle between DNA or aminoacid sequence view.

Another free alignment editor is Jalview (http://www.jalview.org/), which runs on all platforms. One of the possible views looks like this and gives a quality measuring for each position within the alignment:

Search with family descriptions (signatures, patterns, HMMS)

MulTiple sequences in an alignment can be regarded as family members. If database searches for other similar sequences are not satisfactory, one can summarize a family in a variety of ways. The simplest one is to create a consensus sequence by representing each column in the mulTiple sequence alignment by its most frequent residue. To allow more flexibility as well as flexible distances between key residues, signatures/patterns can be described. To include the environment of each residue, Hidden Markow Models (HMMs) are used. For all these approaches specialized tools and databases are available. To get started visit http://ca.expasy.org/ and check the `Databases' and `Tools and software packages' section.

General Tips

Remember the server, the database, and the program version you used (in case you or somebody else wants to reproduce the results).
Write down the ID (identification) and AC (accession) number of your sequences.
Write down program parameters. Often these can be found in the results of programs or in log files. If this is not the case, write them down or make a screenshot.
Check results by using different tools which rely on different methods, change the program parameters. Artefacts tend to be not robust.
Use up to date databases.
Be aware that programs make mistakes and databases contain mistakes, therefore be an educated user.

References

Bibliography

1.: Setubal J, Meidanis J. Introduction to Computational Molecular Biology. PWS; 1997.
2.: Gusfield D. Algorithms on Strings, Trees, and Sequences. Cambridge University Press; 1997.
3.: Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis. Cambridge University Press; 1998. [b7]
4.: Jones N, Pevzner P. An Introduction to Bioinformatics Algorithms. MIT Press; 2004. [b8]
5.: Mount D. Bioinformatics. 2nd ed. Cold Spring Harbor Laboratory Press; 2005. [b9]
6.: Smith TF , Waterman MS . Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. PubMed. [PubMed: 7265238]
7.: Altschul SF , Gish W , Miller W , Myers EW , Lipman DJ . A basic local alignment search tool. J Mol Biol. 1990;215:403–410. PubMed. [PubMed: 2231712]
8.: Altschul SF , Madden TL , Shaffer AA , Zhang J , Zhang Z , Miller W , Lipman DJ . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. PubMed. [PMC free article: PMC146917] [PubMed: 9254694]
9.: Korff, I, Yandell, M, Blendel. J. BLAST. O'Reilly; 2003. [b10]
10.: Kurtz S , Phillippy A , Delcher AL , Smoot M , Shumway M , Antonescu C , Salzberg SL . Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12. PubMed. [PMC free article: PMC395750] [PubMed: 14759262]
11.: Chenna R , Sugawara H , Koike T , Lopez R , Gibson TJ , Higgins DG , Thompson JD . MulTiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003;31(13):3497–3500. PubMed. [PMC free article: PMC168907] [PubMed: 12824352]
12.: Notredame C , Higgins DG , Heringa J . T-Coffee: A novel method for fast and accurate mulTiple sequence alignment. J Mol Biol. 2000;302(1):205–217. PubMed. [PubMed: 10964570]
13.: Edgar RC . MUSCLE: mulTiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. PubMed. [PMC free article: PMC390337] [PubMed: 15034147]
14.: Dayhoff MO , Schwartz RM , Orcutt BC . A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978;5:345–352. [b11]
15.: Henikoff S , Henikoff JG . Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89:10915–10919. PubMed.[b12] [PMC free article: PMC50453] [PubMed: 1438297]

Bookshelf ID: NBK6831

Contents

< Prev Next >

PubReader
Print View
Cite this Page
Setubal JC, Braeuning R. Similarity Search. 2006 Feb 10 [Updated 2007 Mar 9]. In: Gruber A, Durham AM, Huynh C, et al., editors. Bioinformatics in Tropical Disease Research: A Practical and Case-Study Approach [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2008. Chapter A05.
PDF version of this page (1.9M)
PDF version of this title (8.7M)

In this Page

Why Sequence Comparison Is Important
Caveats Regarding Similar Sequences, Evolutionary History, and Gene Function
What Is Sequence Similarity?
Finding the Similarity Between Two Squences by Dynamic Programming
Amino Acid Scoring Systems
Sequence Databases and Statistical Significance of Alignments
BLAST
A Note on Global Alignments
Sequence Comparison for Very Large Sequences
MulTiple Sequence Comparison
Practical Section
References

Related information

PMC
PubMed Central citations
PubMed
Links to PubMed

Recent Activity

Clear Turn Off Turn On

Similarity Search - Bioinformatics in Tropical Disease Research
Similarity Search - Bioinformatics in Tropical Disease Research

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Bioinformatics in Tropical Disease Research: A Practical and Case-Study Approach [Internet].

Chapter A05Similarity Search

Authors

Affiliations

Why Sequence Comparison Is Important

Caveats Regarding Similar Sequences, Evolutionary History, and Gene Function

What Is Sequence Similarity?

Finding the Similarity Between Two Squences by Dynamic Programming

Figure 1

Figure 2

Figure 3

Amino Acid Scoring Systems

Figure 4

Sequence Databases and Statistical Significance of Alignments

BLAST

A Note on Global Alignments

Sequence Comparison for Very Large Sequences

MulTiple Sequence Comparison

Figure 5

Practical Section

1. Why sequence comparison is important

Identity, similarity, percent identity

2. What is sequence similarity

Score DNA sequences

Homology

3. Finding the similarity between two sequences by dynamic programming

Gap penalties

Local alignment

Tip

Global alignment

Dotplot

4. Amino acid scoring systems

BLOSUM number

Matrices

5. Sequence databases and statistical significance of alignments

Tip

6. BLAST

Tip

blast2seq

Word size

Bit score, raw score, E-value

Filtering

BLASTN

Tip

BLASTP

Tip

BLASTX

TBLASTN

Protein vs. DNA sequences

MegaBLAST

PSIBLAST

BLAST for primers

Tip

Tip

More BLAST

More Tips

Tip

Tip

Tip

Tip

7. Sequence comparison for very large sequences

8. MulTiple sequence alignment

Where to find ClustalW, T-coffee, MUSCLE

Tip

How to edit alignments

Search with family descriptions (signatures, patterns, HMMS)

General Tips

References

Further Readings

Bibliography

Views

In this Page

Related information

Recent Activity