序列比對
维基百科,自由的百科全书
序列比對指將兩個或多個序列排列在一起,標明其相似之處。序列中可以插入間隔(通常用短橫線“-”表示)。對應的相同或相似的符號(在核酸中是A, T(或U), C, G,在蛋白質中是氨基酸殘基的單字母表示)排列在同一列上。
tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt
It is usually used to study the evolution of the sequences from a common ancestor, especially biological sequences such as protein sequences or DNA sequences. Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions. Sequence alignment can also be used to study things like the evolution of languages and the similarity between texts.
这一方法常用于研究由共同祖先进化而来的序列,特别是如蛋白质序列或DNA序列等生物序列。在比对中,错配与突变相应,而空位与插入或缺失对应。序列比对还可用于语言进化或文本间相似性之类的研究。
The term sequence alignment may also refer to the process of constructing such alignment or finding significant alignments in a database of potentially unrelated sequences.
术语“序列比对”也指构建上述比对或在潜在的不相关序列的数据库中寻找significant alignments。
目录 |
[编辑] 雙序列比對
Pairwise sequence alignment methods are concerned with finding the best-matching piecewise (local) or global alignments of protein (amino acid) or DNA (nucleic acid) sequences.
双序列比对方法涉及寻找(局部)最优匹配片断或蛋白质(氨基酸)或DNA(核酸)全局比对。
Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions. The most important application of pairwise alignment is identification of sequences of unknown structure or function. Another important use is the study of molecular evolution.
[编辑] 全局比對
A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Further, there are several complications to molecular evolution (such as domain shuffling - see below) which prevent these methods from being useful.
[编辑] 局部比對
Local alignment methods find related regions within sequences - in other words they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B.
This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods.
[编辑] 序列比對的重要性
The typical assumption in the use of alignment is the mechanism of molecular evolution. DNA carries over genetic material from generation to generation, by virtue of its semi-conservative duplication mechanism. Changes in the material are introduced by occasional errors and mutations in the duplication, and by viruses and other mechanisms which sometimes move sub-sequences within the chromosome and between individuals. Consequently, an alignment between sequences indicates that the sequences evolved from a common ancestor which contained the matching subsequences. In the case of genetic sequences, it implies that their carriers evolved from a common ancestor.
Using assumptions about the probabilities of these change events, we can estimate the time when sequences diverged from a common ancestor or the time required for changing one sequence into another. However, there is disagreement about the value and nature of these probabilities for biological evolution. One school of thought assumes a simple, constant rate of change (an application of Occam's Razor) while the other school assumes short evolutionary periods when changes were extremely high.
Many genetic changes are evolutionarily non-significant (that is, neutral, neither positive or negative) in that they do not affect the selective advantage (the fitness) of the carrier. This is the basis of Kimura's Neutral theory. This is partly supported by examples such as the Globin protein sequence which shows normal functionality in-spite of several changes at various bases but loses functionality only when changes occur at critical sequence locations.
The actual biological meaning of any alignment can never be absolutely guaranteed. However, statistical methods can be used to assess the likelihood of finding an alignment between two regions (or sequences) by chance, given the size of the database and its composition.
Two important related issues for sequence alignment are:
- How should we choose the best alignment between two sequences (or regions of sequences)?
- How do we rank the alignments between a query and the many sequences in the database, according to their significance in the domain (such as biological significance)?
In studying evolution, the first question can be addressed by developing a model of how likely are certain changes between characters in the sequence. There are many ways to do this, none of which is generally superior. The models are derived empirically using related sequences, and are expressed as substitution matrices. These matrices are used by the algorithms named below to evaluate alternative alignments between two sequences. The actual biological quality of the alignments then depends upon the evolutionary model used to generate the score.
The second question is purely statistical. Research has determined a few hard theoretical rules and many approximations. It is now generally accepted that the scores of alignments between random sequences follow the extreme value distribution. Pairwise alignment programs such as BLAST use simulation to estimate the parameters of this distribution given a particular query, database, substitution matrix and certain other parameters. Alignments can then be given a statistical significance value, allowing inference of possible relationships between sequences.
[编辑] 結構比對
One potential way to validate sequence alignments is to use structural alignments where possible. Because protein structure is more conserved through evolution than is sequence (as shown by Cyrus Chothia and Arthur Lesk in 1986), structural alignments can be more reliable over long evolutionary distances, when the sequences have diverged so much that simple sequence comparison cannot detect their similarity. However, although structural alignments are useful, they are completely artifactual.
[编辑] 多序列比對
Multiple alignment is an extension of pairwise alignment to incorporate more than two sequences into an alignment. Multiple alignment methods try to align all of the sequences in a specified set. Alignments help in the identification of common regions between the sequences. There are several approaches to creating multiple sequence alignments, one of the most popular being the progressive alignment strategy used by the Clustal family of programs. Clustal is used in cladistics to build phylogenetic trees, and to build sequence profiles which are used by PSI-BLAST and Hidden Markov model- (HMM-) methods to search sequence databases for more distant relatives.
Multiple sequence alignment is computationally difficult and is classified as an NP-Hard problem.
[编辑] 算法
[编辑] Needleman-Wunsch
Pairwise global alignment only.
[编辑] Smith-Waterman
Pairwise, local or global alignment. It works by generating all six possible translations of each nucleotide sequence, then uses those translations for protein search.
[编辑] Framesearch
This is an extension of Smith-Waterman, for pairwise alignment between a protein sequence and a nucleotide sequence.
Framesearch dynamically considers every possible single-nucleotide insertion or deletion (indel) to generate the translation that best matches the protein sequence. This is unlike six-frame-translated BLAST or Smith-Waterman, which generate all six possible translations of each nucleotide sequence and use them for protein search. This allows dramatically better alignments when the nucleotide sequence contains many indels, as is often the case with single-read expressed sequence tag sequences and very early versions of genomic sequences.
Unfortunately Framesearch is extremely CPU-intensive. Its practical use typically requires a large Computer cluster or specialized computer hardware optimized for running dynamic programming algorithms on a large scale. Several companies sell such systems, which use either custom integrated circuits or Field-programmable gate array technology.
[编辑] 輭件
[编辑] SSearch
Implements the standard Smith-Waterman algorithm. It is considerably slower than the more modern BLAST and FASTA methods. However, Smith-Waterman remains the gold standard for protein-protein or nucleotide-nucleotide pairwise alignment; its speed can be improved by using specialized hardware or a computer cluster.
[编辑] BLAST (Basic Local Alignment Search Tool)
Main article: BLAST
This method uses a pre-computed hash table to serve as an index for short sequences. Given a query sequence, the sub-sequences are looked up in the index to reduce the amount of time and searching involved. Several parameters need to be provided to make this method faster or more accurate. Once patterns that match the search sequence are found, more accurate and intensive algorithms may be applied.
BLAST uses a pairwise local search and uses a number of methods to increase the speed of the original Smith-Waterman algorithm.
这一方法利用一个预先计算的哈西表作为短序列的索引。给定一个被查询序列,将根据索引来查询子序列,从而减少查询次数和时间。提供一些参数将使该方法更快或更准确。检索到与检索序列匹配的模式后,需要进一步使用更加准确和深入的算法。
BLAST 利用成对的本地检索和许多其他方法来提高Smith-Waterman算法的速度。
[编辑] FASTA
Main article: FASTA
Pairwise local search. Superseded by BLAST.
[编辑] Clustal
Main article: Clustal
Progressive multiple alignment method. Comes in several varieties (ClustalW, ClustalX etc.)
[编辑] 外部鏈接
- Blast Server at the NCBI
- An excellent article at the NCBI web site on the methodology of the BLAST algorithm and the statistical significance of sequence alignments in general.
- JAligner is an open source Java implementation of the dynamic programming algorithm Smith-Waterman for biological pairwise local sequence alignment.