Alignment Gaps and Sites with Missing Information

Gaps often are inserted during the alignment of homologous regions of sequences and represent deletions or insertions (indels). They introduce some complications in distance estimation. Furthermore, sites with missing information sometimes result from experimental difficulties; they present the same alignment problems as gaps. In the following discussion, both of these situations are treated in the same way.

In MEGA, there are two ways to treat gaps. One is to delete all of these sites from the data analysis. This option, called the Complete-Deletion, is generally desirable because different regions of DNA or amino acid sequences evolve under different evolutionary forces. The second method is relevant if the number of nucleotides involved in a gap is small and if the gaps are distributed more or less randomly. In that case it may be possible to compute a distance for each pair of sequences, ignoring only those gaps that are involved in the comparison; this option is called Pairwise-Deletion. The following table illustrates the effect of these options on distance estimation with the following three sequences:

1 10 20

seq1 A-AC-GGAT-AGGA-ATAAA

seq2 AT-CC?GATAA?GAAAAC-A

seq3 ATTCC-GA?TACGATA-AGA  Total sites = 20.

Here, the alignment gaps are indicated with a hyphen (-) and the missing information sites are denoted by a question mark (?).

Complete-Deletion and Pairwise-Deletion options

 

 

Differences/Comparisons

Option

Sequence Data

(1,2)

(1,3)

(2,3)

Complete deletion

1. A C GA A GA A A A

2. A C GA A GA A C A

3. A C GA A GA A A A

1/10

0/10

1/10

Pairwise Deletion

1. A-AC-GGAT-AGGA-ATAAA

2. AT-CC?GATAA?GAAAAC-A

3. ATTCC-GA?TACGATA-AGA

2/12

3/13

3/14

In the above table, the number of compared sites varies with pairwise comparisons in the Pairwise-Deletion option, but remains the same for pairwise comparisons in the Complete-Deletion option. In this data set, more information can be obtained by using the Pairwise-Deletion option. In practice, however, different regions of nucleotide or amino acid sequences often evolve differently, in which case, the Complete-Deletion option is preferable.