3. Basic Sequence Statistics


In the study of molecular evolution it is often necessary to know some basic statistical quantities such as nucleotide frequencies, codon frequencies, and transition/transversion ratios. The statistical quantities that can be computed by MEGA are discussed in this chapter.

3.1 Nucleotide and Amino Acid Compositions

The relative frequencies of the four nucleotides (nucleotide composition) or of the twenty amino acid residues (amino acid composition) can be computed for a specific sequence or for all the sequences used.

Example 3.1 Nucleotide composition of HLA sequences.

   ---------- Nucleotide composition ----------
   All values in per cent (%) except Totals 

                A     T     C     G      Total 

   HLA-A2    20.8  15.2  29.8  34.2        822 
   HLA-A3    20.4  14.7  30.2  34.7        822 
   HLA-All   20.6  14.1  30.5  34.8        822 
   HLA-AW24  20.9  14.6  30.2  34.3        822 
   HLA-AW68  20.7  14.8  30.2  34.3        822 
   All       20.7  14.7  30.2  34.5       4110

For coding regions of DNA, three additional tables are presented for the nucleotide compositions at first, second, and third codon positions. From these tables the G + C content can easily be computed. The amino acid composition can also be presented in a similar tabular form.

3.2 Codon Usage

There are 64 (43) possible codons that code for 20 amino acids (and stop signals), so an amino acid may be encoded by several codons (e.g., serine is encoded by six codons in nuclear genes). It is therefore interesting to know the codon usage for each amino acid. In MEGA the numbers of the 64 codons used in a gene can be computed either for a specific sequence or for all sequences examined. Four different genetic codes are included; the "universal" code and the mammalian, Drosophila, and yeast mitochondrial genetic codes.

MEGA is also capable of computing Sharp et al.'s (1986) relative synonymous codon usage (RSCU). RSCU is the observed frequency of a codon divided by its expected frequency under the assumption of equal codon usage. That is,

Equation 3.1(3.1)

Here, Xij. is the number of occurrences of the j-th codon for the i-th amino acid, and ni is the number (from one to six) of alternative codons for the i-th amino acid. This index is useful for knowing the codons that are used more often or less often than expected under the assumption of equal usage.

Example 3.2 Codon frequencies and RSCU values for HLA-A2.

   -------- codon Usage --------
   Codon Usage Table for HLA-A2 
   Frequency of codons and relative synonymous codon usage (RSCU) 

   TTT (F)   0 (0.00)       ...       TGT (C)   0 (0.00) 
   TTC (F)   8 (2.00)       ...       TGC (C)   4 (2.00) 
   TTA (L)   0 (0.00)       ...       TGA (*)   0 (0.00) 
   TTG (L)   2 (0.71)       ...       TGG (W)  10 (1.00) 
    .
    .
    .
   GTT (V)   0 (0.00)       ...       GGT (G)   3 (0.60) 
   GTC (V)   2 (0.50)       ...       GGC (G)   7 (1.40) 
   GTA (V)   0 (o.OO)       ...       GGA (G)   2 (0.40) 
   GTG (V)  14 (3.50)       ...       GGG (G)   8 (1.60) 

   Total codons scored: 274 
   '*' indicates a stop codon. 
   RSCU is given in parentheses.

3.3 Nucleotide Pair Frequencies

When two nucleotide sequences are compared, the frequencies of 10 different types of nucleotide pairs can be computed. In MEGA these frequencies are tabulated in the following form.

Example 3.3 Nucleotide pair frequencies for alleles of the HLA-A locus.

   ------- Observed nucleotide pair frequencies -------
   n: total number of nucleotides compared 
   ns: number of transitional differences
   nv: number of transversional differences
   nd : ns+nv (total number of nucleotide differences)

                        Tran-      Trans-        Identical 
                       sition     version           pair 
                        AG TC   AT AC TG CG    AA  TT  CC  GG   ns/nv   nd    n 
   HLA-A2 vs. HLA-A3    11  5    2  2  5  8   162 117 239 271   0.94    33   822
   HLA-A2 vs. HLA-All   11  8    3  4  4 10   161 113 237 271   0.90    40   822
   HLA-A2 vs. HLA-AW24  13 11    3  1  5 15   163 113 233 265   1.00    48   822
   HLA-A2 vs. HLA-AW68   3  2    2  2  5 Il   167 119 239 272   0.25    25   822

3.4 Alignment Gap Frequencies

The observed numbers of alignment gaps of different lengths (sites) are useful for studying the distribution of insertions/deletions and for deciding whether all sites containing gaps should be deleted (see section 4.5). In MEGA, the numbers of gaps of length I to 10 can be computed either for each sequence or for all sequences. The numbers of gaps longer than 10 sites are pooled together with the number of gaps of length 10.

Example 3.4 Alignment gap frequencies for HLA sequences.

   -------- Alignment Gap Frequencies --------
   All entries in the table are the observed number of occurrences 
              l    2    3   ...   >10  Total 
   HLA-A2     0    0    0   ...   1    1
   HLA-A3     0    0    0   ...   1    1
   HLA-All    0    0    0   ...   1    1
   HLA-AW24   0    0    0   ...   1    1
   HLA-AW68   0    0    0   ...   1    1
   All        0    0    0   ...   5    5

3.5 Variable Regions of Sequences

It is well known that some regions of DNA or amino acid sequences are more variable than others. For example, the control region of mammalian mitochondrial DNA has two hypervariable segments (Kocher and Wilson 1991). One way of detecting such variable regions is to examine the number of variable sites in different segments of the DNA. In MEGA, the numbers of variable sites in overlapping and nonoverlapping segments of equal size can be computed for any segment size (window size). In the output, the numbers of variable sites in overlapping (sliding window) or nonoverlapping segments of a specified size are given along with a histogram.

Example 3.5 Nonoverlapping windows for HLA-A sequence data.

   -------- Variability --------
   Total number of variable sites: 71 
   Numbers of variable sites in nonoverlapping segments of size 100 

   Location 

     1-1OO |  6 | ******
   101-200 |  5 | *****
   201-300 | 19 | *******************
   301-400 | 10 | **********
   401-500 |  7 | *******
   501-600 | 13 | *************
   601-700 |  5 | *****
   701-800 |  5 | *****
   801-    |  1 | * 


[Next] [Table of Contents]