See also how to create an alignment..
Homologous (related) sequences are aligned in such a way that letters (nucleotides for DNA or amino acids in the case of proteins) which are listed
in the same column are thought to originate from the same ancestral letter.
If all letters in a column are identical, this position is thought to be conserved. If the letters differ then it is assumed that somewhere
along evolutionary time span a change (mutation) has taken place.
Because homologous sequences are rarely of the same length, it is necessary to induce gaps into the alignment, so that the following
columns are correctly aligned.
Example:
Sequence1 MCQFHRYM
Sequence2 M-QFHRYM
Sequence3 M-QFHRDM
The alignment consists of 8 columns, and with the exception of column #2 and #7, all columns are conserved.
Column #2 contains gaps because only Sequence1 contains an aspartic acid (D) at this position. From an evolutionary
perspective this could either be explained by the ancestral sequence having an aspartic acid at this position, but
Sequence2 and Sequence3 having lost it; or the ancestral sequence not having the aspartic acid and Sequence1 has gained
it during the course of evolution.
Column #7 on the other hand contains no gaps, but the position is also not conserved. Sequence1 and Sequence2 show a
tyrosine (Y) at this position, whereas Sequence3 shows an aspartic acid (D). One can either hypothesize that the ancestral
sequence contained a tyrosine and Sequence3 suffered a mutation event or that the ancestral sequence contained an aspartic acid
and Sequence1 and Sequence2 are the mutated sequences.
If we have following two sequences (note the differences in red):
Sequence1 TLEVEPS
Sequence2 TDVEPS
and we try to align them by hand, we would get two possible alignments:
Alignment 1:
Sequence1 TLEVEPS
Sequence2 TD-VEPS
Alignment 2:
Sequence1 TLEVEPS
Sequence2 T-DVEPS
How do we decide which of those two possibilities to use?
The answer lies in the substitution matrices. A Substitution matrix is a table which scores the possibilities
of mutations (substitutions). For example a substitution L to D has a score of -4 in the BLOSUM62-Matrix and a
substitution from E to D has a score of +2.
The substitution score describes the probability of such a mutation occuring. The higher the score the more likely
such a mutation is. In the example of the first column (T remains T) this event has a score of +5; as we could have imagined
it is thus more likely that no mutation occurs (higher score) then that a mutation occurs (lower score).
Gaps are scored in a similar fashion (i.e. possibility that a amino acid vanishes).
Scores are calculated by aligning many sequences by hand and calculating the possibilities of all possible substitutions.
The Blosum62-Matrix can be viewed on the NCBI website.
In the example above we would decide to use Alignment 2, since E to D has a higher score (has a higher probability) than L to D.
There are two main types of alignments, global and local alignments.
Example:
Sequence1 GVTQLNRLAA
Sequence2 DTQLRRLCDA
Global Alignment
GVTQLNRLA-A
-DTQLRRLCDA
Local Alignment
TQLNRL
TQLRRL
Global alignments try to align the entire sequence, whereas local alignments only output the part of the sequence
which gives the best (highest scoring) alignment (i.e. the region which has the highest similarity).
A special case is the cost-free end (CFE) alignment. This is essentially a global alignment, with the difference that
gaps are free at the start and the end of the alignment. This allows to resolve overlapping alignments.
Cost-free End (CFE) Alignment
LRMETRELNYGRL-------
--------NYGRLQNQLAKK
GeneWarrior let's you choose between Local alignments (named "Highest similarity region" in the toolbox) and CFE alignments (named
"Full sequences") for pairwise alignments (alignment of two sequences).
For multiple sequence alignment (MSA) a third-party software is used (MUSCLE), which performs a mix of Global and Local
alignment.
See the Tutorial on how to create an alignment.