I decide to devote this week’s topic to proportion test. The test has a very common name, two-proportion z-test. It is very different with those very popular parametric (t-test) and nonparametric (ks.test or wilcoxon test) tests. Nonetheless, it is increasingly important in biology, in particular, in analysis of copy numbers variation.

Suppose we have two sets of measurement, there are n1 members in one set while n2 members in the other set. The outputs for all the measurement range from 0 to M (an integer) indicating different states of the members. Let X1 stand for number of members with output value of interests in the n1-member set, while X2 stands for the number in the other set.

Then we want to know whether or not the two sets have same proportion of the output of interest?

For example, the copy number of for two sets of genes (one with 100 genes and the other with 200 genes) are measured and it varies from 0 to 4, in which, 2 means normal (one from father and one from mother). Suppose we have 80 genes with copy number 3 in the 100-gene set, while 100 in the 200-gene set. Then we have n1=100, n2=100, X1=80, X2=100, and the output value of interest is that the copy number is 3. The question is that do the two sets have same or different proportion of genes with copy number 3?

In order to answer it, we need calculate a z-score:

The **assumptions** are that

X1 >= 5 and X2 >= 5 and (n1-X1) >= 5 and (n2-X2) >= 5. [Some people increase 5 to 10.]

The z-score is then a standard distribution, that is, a normal distribution with mean equal to 0 and standard deviation equal to 1. Therefore, if Z <= 1.96, then the probability of that they are same (null hypothesis) is less than 5% or the p value is less than 0.05.
**A little example in R is here:**

g1<-read.table(file="http://baoqiang.org/rdata/g1.dat")
g2<-read.table(file="http://baoqiang.org/rdata/g2.dat")
x1 <- length(which(g1[,2]==3))
x2 <- length(which(g2[,2]==3))
n1 <- dim(g1)[1]
n2 <- dim(g2)[1]
p1 <- x1/n1
p2 <- x2/n2
p <- (x1+x2)/(n1+n2)
min <- 10
if (x1 < min || (n1-x1) < min) {
p1 <- (x1+2)/(n2+4)
}
if (x2 < min || (n2-x2) < min) {
p2 <- (x2+2)/(n2+4)
}
if ((x1+x2) < min || (n1+n2-x1-x2) < min) {
p <- (x1+x2+2)/(n1+n2+4)
}
z <- (p1-p2)/(p*(1-p)*(1/n1+1/n2))
print(z)

Here is the z-score:

[1] -108.3927

**Calculate the p-value then**

pz <- z/(1/sqrt(n1+n2))
pvalue <- 2*pnorm(-abs(pz))
print(pvalue)
p-value is:
[1] 0
Conclusion, the two data have different proportion of copy number 3 in their measured genes, therefore, in the two data, their chromosomes are significantly different in terms of copy number.