Large Sample Tests of Selection

One way to test whether positive selection is operating on a gene is to compare the relative abundance of synonymous and nonsynonymous substitutions that have occurred in the gene sequences. For a pair of sequences, this is done by first estimating the number of synonymous substitutions per synonymous site (dS) and the number of nonsynonymous substitutions per nonsynonymous site (dN), and their variances: Var(dS) and Var(dN), respectively. With this information, we can test the null hypothesis that H0: dN = dS using a Z-test:

Z = (dN - dS) / SQRT(Var(dS) + Var(dN))

The level of significance at which the null hypothesis is rejected depends on the alternative hypothesis (HA).

H0: dN = dS

HA: (a) dN ¹ dS (test of neutrality).

dN > dS (positive selection).
dN < dS (purifying selection).

For alternative hypotheses (b) and (c), we use a one-tailed test and for (a) we use a two-tailed test. These three tests can be conducted directly for pairs of sequences, overall sequences, or within groups of sequences. For testing for selection in a pairwise manner, you can compute the variance of (dN - dS) by using either the analytical formulas or the bootstrap resampling method.

For data sets containing more than two sequences, you can compute the average number of synonymous substitutions and the average number of nonsynonymous substitutions to conduct a Z-test in manner similar to the one mentioned above. The variance of the difference between these two quantities is estimated by the bootstrap method (Nei and Kumar [2000], page 56).