Archive for September, 2007

Personal genomics

Friday, September 28th, 2007

The genomics field has been exciting recently, in particular from the human geonimcs perspective, the availability of the genomes from James Watson and Craig Venter. The question comes to my mind, which I am sure that was already asked long before people started sequencing the two genomes, what can we do with them? (What can we make out of them?)

There have been lots of discussion about the potential benefices on a variety of platforms, web/blogs, media, conferences, and papers. The personal genomics will be impacting our future in many perspective, such as, ethical, social, and clinical. In particular, more celebrates are coming forward to have their genomes sequenced. The richness of the individual genomes present an unprecedented opportunity of studying human genotype and phylotype, and more excitingly, the relationship between the genotype and phylotype for some individuals or populations. The relationship could have broad implications, such as, disease pathogenesis, privacy protection, etc.

Over all, the personal genomics remain an opportunity and as well as a challenge to us, and probably will be for quite a while. So, the question comes, again, what can we make out of them?

New confront on nucleosome occupancy in yeast

Friday, September 21st, 2007

Research on nucleosome occupancy in yeast is heating up again. About half year ago, Segal made headline on the New York Times with their discovery of genomic code for nucleosomes in yeast. Their results were published on Nature. The central message from the paper is that AA/TT, TA, and GC patterns (with a period of 10 bps) in DNA sequences can be used to predict nucleosome position. They developed a Markov model based on hundreds of highly selected and experimentally verified nucleosome sequences. The method is available free for public.

But the nucleosome story never ends there because of the latest paper on Nature Genetics. Let’s see how these authors comment on Segal’s discovery:

The well-established periodic (AA/TT/TA)-dominated dinucleotide nucleosome positioning pattern seems to have much less correlation with global nucleosome occupancy than other features. Because this pattern is clearly relevant in vitro and the signal is present across the genome, it is curious that the 199 sequences used to train Segal et al.’s model have a nearly random distribution of occupancy ratios in our data and do not correspond to well-positioned nucleosomes (data not shown). These apparent discrepancies can be reconciled if nucleosome occupancy across the genome is directed more often by exclusion signals (which would include almost all of the parameters in Fig. 6c), whereas local ‘translational and rotational’ settings, in addition to strongly positioned nucleosomes, are specified by periodic signals. In support of this possibility, the periodic AA/TT/TA signal is apparent in purified nucleosomal DNA fragments from Caenorhabditis elegans, although overall these dinucleotides are depleted in nucleosomal DNA relative to adjacent DNA.

The nice thing is that the authors established a strong correlation for nucleosome positions in yeast between their experimental data and Yuan’s experimental data, which makes the paper solid.

It is amazing. The scientific fact is out there, regardless of who (how hard researchers) have been working on finding it, the discovery might need more than one try until we get it.

Two-proportion z-test

Monday, September 10th, 2007

I decide to devote this week’s topic to proportion test. The test has a very common name, two-proportion z-test. It is very different with those very popular parametric (t-test) and nonparametric (ks.test or wilcoxon test) tests. Nonetheless, it is increasingly important in biology, in particular, in analysis of copy numbers variation.

Suppose we have two sets of measurement, there are n1 members in one set while n2 members in the other set. The outputs for all the measurement range from 0 to M (an integer) indicating different states of the members. Let X1 stand for number of members with output value of interests in the n1-member set, while X2 stands for the number in the other set.

Then we want to know whether or not the two sets have same proportion of the output of interest?

For example, the copy number of for two sets of genes (one with 100 genes and the other with 200 genes) are measured and it varies from 0 to 4, in which, 2 means normal (one from father and one from mother). Suppose we have 80 genes with copy number 3 in the 100-gene set, while 100 in the 200-gene set. Then we have n1=100, n2=100, X1=80, X2=100, and the output value of interest is that the copy number is 3. The question is that do the two sets have same or different proportion of genes with copy number 3?

In order to answer it, we need calculate a z-score:

The assumptions are that
X1 >= 5 and X2 >= 5 and (n1-X1) >= 5 and (n2-X2) >= 5. [Some people increase 5 to 10.]

The z-score is then a standard distribution, that is, a normal distribution with mean equal to 0 and standard deviation equal to 1. Therefore, if Z <= 1.96, then the probability of that they are same (null hypothesis) is less than 5% or the p value is less than 0.05. A little example in R is here:

g1<-read.table(file="") g2<-read.table(file="") x1 <- length(which(g1[,2]==3)) x2 <- length(which(g2[,2]==3)) n1 <- dim(g1)[1] n2 <- dim(g2)[1] p1 <- x1/n1 p2 <- x2/n2 p <- (x1+x2)/(n1+n2) min <- 10 if (x1 < min || (n1-x1) < min) { p1 <- (x1+2)/(n2+4) } if (x2 < min || (n2-x2) < min) { p2 <- (x2+2)/(n2+4) } if ((x1+x2) < min || (n1+n2-x1-x2) < min) { p <- (x1+x2+2)/(n1+n2+4) } z <- (p1-p2)/(p*(1-p)*(1/n1+1/n2)) print(z)

Here is the z-score:
[1] -108.3927

Calculate the p-value then

pz <- z/(1/sqrt(n1+n2)) pvalue <- 2*pnorm(-abs(pz)) print(pvalue) p-value is: [1] 0 Conclusion, the two data have different proportion of copy number 3 in their measured genes, therefore, in the two data, their chromosomes are significantly different in terms of copy number.


Wednesday, September 5th, 2007

It is quite interesting to see a database for (almost) all biological pathways which is not limited to human but including other organisms as well.
Check out here:
Reference is here “Reactome: a knowledgebase of biological pathways“.

An explanation from Wikipedia is

It is an on-line encyclopedia of core human pathways – DNA replication, transcription, translation, the cell cycle, metabolism, and signaling cascades – and can be browsed to retrieve up-to-date information about a topic of interest, e.g., the molecular details of the signaling cascade set off when the hormone insulin binds to its cell-surface receptor, or used as an analytical tool for the interpretation of large data sets like those generated by DNA microarray analysis.