- •SEQUENCE ALIGNMENT
- •Sequence alignment produced with the freely available program ClustalW between two zinc finger
- •WHY DO SIMILARITY SEARCH?
- •WARNING: SIMILARITY NOT
- •BLAST
- •EXAMPLE BLAST QUESTIONS
- •IDENTIFYING SIMILARITY
- •NEEDLEMANWUNSCH
- •NEEDLEMANWUNSCH DETAILS
- •NEEDLEMANWUNSCH
- •ALGORITHM
- •SMITH–WATERMAN ALGORITHM
- •SMITHWATERMAN
- •GLOBAL VS. LOCAL
- •COMPLEXITY
- •OTHER OBSERVATIONS
- •BLAST
- •BLAST is one of the most widely used bioinformatics programs, because it addresses
- •EXAMPLES
- •PROGRAM
- •OUTPUT
- •HOW IT WORKS?
- •BLASTP ALGORITHM (A PROTEIN TO
- •List all of the HSPs in the database whose score is high enough
- •EXTENSIONS
- •EXTENSIONS
- •USES OF BLAST
- •VIDEO LINKS
- •VIDEO LINKS
- •TUTORIALS
ALGORITHM
For each cell, compute
Match score: sum of preceding diagonal cell and score of aligning the two letters (+1 if match, 1 if no match)
Horizontal gap score: sum of score to the left and gap score (1)
Vertical gap score: sum of score above and gap score (1)
Choose highest score and point arrow towards maximum cell
When you finish, trace arrows back from lower right to get alignment
SMITH–WATERMAN ALGORITHM
The Smith–Waterman algorithm is a well known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences.
Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.
Smith–Waterman is a dynamic programming algorithm
SMITHWATERMAN
The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981.
Modification of NeedlemanWunsch
Edges of matrix initialized to 0
Maximum score never less than 0
No pointer unless score greater than 0
Traceback starts at highest score (rather than lower right) and ends at 0
GLOBAL VS. LOCAL
Global – both sequences aligned along entire lengths
Local – best subsequence alignment found
Global alignment of two genomic sequences may not align exons
Local alignment would only pick out maximum scoring exon
COMPLEXITY
O(mn) time and memory
This is impractical for long sequences!
Observation: during fill phase of the algorithm, we only use two rows at a time
Instead of calculating whole matrix, calculate score of maximum scoring alignment, and restrict search along diagonal
OTHER OBSERVATIONS
Most boxes have a score of 0 – wasted computation
Idea: make alignments where positive scores most likely (approximation)
BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the aminoacid sequences of different proteins or the nucleotides of DNA sequences.
A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.
BLAST is one of the most widely used bioinformatics programs, because it addresses a fundamental problem and the heuristic algorithm it uses is much faster than calculating an optimal alignment.
This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster.
Before fast algorithms such as BLAST
and FASTA were developed, doing database searches for protein or nucleic sequences was very time consuming because a full alignment procedure (e.g., the Smith–Waterman algorithm) was used.
EXAMPLES
Examples of other questions that researchers use BLAST to answer are:
Which bacterial species have a protein that is related in lineage to a certain protein with known aminoacid sequence?
Where does a certain sequence of DNA originate?
What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined?
BLAST is also often used as part of other algorithms that require approximate sequence matching.
PROGRAM
The BLAST algorithm and the computer program that implements it were developed by Stephen Altschul, Warren Gish, and David Lipman at the U.S.National Center for Biotechnology Information (NCBI), Webb Miller at the Pennsylvania State University, and Gene Myers at the University of Arizona.
It is available on the web on the NCBI website