Archive for June, 2007

No post for the week of Jun 18 – Jul 6

Tuesday, June 19th, 2007

I’m attending the 18th International Arabidopsis Research Conference in Beijing.

Notes from “A genomic code for nucleosome positioning”

Wednesday, June 13th, 2007

I have read the paper for a while, it is actually a good one. Here is the reference:

Nature 442, 772-778 (17 August 2006) | doi:10.1038/nature04979; Received 16 March 2006; Accepted 14 June 2006; Published online 19 July 2006
A genomic code for nucleosome positioning
Eran Segal, Yvonne Fondufe-Mittendorf, Lingyi Chen, AnnChristine Thåström, Yair Field, Irene K. Moore, Ji-Ping Z. Wang and Jonathan Widom

Here are some notes that I took.

Data for nucleosome positioned sequences
Nucleosome-bound DNA sequences:
Chicken: Satchwell, S.C., Drew, H.R. & Travers, A.A. Sequence periodicities in chicken nucleosome core DNA. J Mol Biol 191, 659-75 (1986)
Mouse: Widlund, H.R. et al. Identification and characterization of genomic nucleosome-positioning sequences. J Mol Biol 267, 807-17 (1997)

Genome-wide nucleosome maps:
• Yuan, G.C. et al. Genome-scale identification of nucleosome positions in S. cerevisiae. Science 309, 626-30(2005)
• Lee, C.K., Shibata, Y., Rao, B., Strahl, B.D. & Lieb, J.D. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat Genet 36, 900-5(2004)
• Bernstein, B.E., Liu, C.L., Humphrey, E.L., Perlstein, E.O. & Schreiber, S.L. Global nucleosome occupancy in east. Genome Biol 5, R62 (2004)
Genomic sequences, predicted transcription start sites, and gene and chromosome annotations:
Cherry, J.M. et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res 26, 73-9 (1998)

• Functional DNA binding site motifs, defined as motifs that are both conserved and bound by their matching transcription factor
• Genes bound by transcription factors
• Canonical sites (as opposed to transcription factor binding sites), defined as those regions that do not overlap with the functional sites and whose score is a fraction of 0.9 or more from the best possible score that can be achieved with the weight matrix.

Harbison, E.T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004)

Functional TATA boxes
Basehoar, A.D., Zanton, S.J. & Pugh, B.R. Identification and distinct regulation of yeast TATA box-containing genes. Cell 116, 6990709 (2004)

Functional annotation for genes can be downloaded from Gene Ontology.
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9(2000)

Genes bound by chromatin remodeling factors
Robert, F. et al. Global position and recruitment of HATs and HDACs in the yeast genome. Mol Cell 16, 199-209 (2004)

Genes bound by members of the nuclear pore complex
Casolari, J.M. et al. Genome-wide localization of the nuclear transport machinery couples transcriptional status and nuclear organization. Cell 117, 427-39 (2004)

Nucleosome positioned sequences
Two-fold symmetry
Richmond, T.J. & Davey, G.A. The structure of DNA in the nucleosome core. Nature 423, 145-50 (2003)

Simplest elements capture the sequence-dependence of DNA bending
Olson, W.K., Gorin, A.A., Lu, X.J., Hock, L.M. & Zhurkin, V.B. DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. Proc. Natl. Acad. Sci. USA 95, 11163-8 (1998)

Models for nucleosome positioned sequences
Mixture model
Wang, J.P. & Widom, J. Improved alignment of nucleosome DNA sequences using a mixture model. Nucleic Acids Res 33, 6743-55 (2005)

Expectation Maximization algorithm
Dempster, A.P., L.N.M., Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 1-39 (1977)

Statistical tests for nucleosome occupancy
• Nucleosome occupancy at functional transcription factor DNA binding sites vs canonical (but presumed non-functional) sites
• Non-functional binding sites vs intergeneic regions
Purpose: Does the functional and conserved binding sites of transcription factors prefer higher/lower nucleosome occupancy than their canonical sites?
Student t-test
Kolmogorov-Smirnov test

Evaluating statistical significance of nucleosome occupancy of a gene set
Purpose: Does this specific gene set prefer higher/lower nucleosome occupancy in their coding regions or promoter regions?
Kolmogorov-Smirnov test
Use multiple hypothesis testing with the false discovery rate (FDR) of 0.05
FDR: Benjamini, Y., H. Y. Controlling the false discovery rate –a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 57 (1), 298-300 (1995).

Evaluating statistical significance of nucleosome scores at intergenic and coding regions
Purpose: Are the nucleosome model scores at intergenic and coding significantly different than random expectation?
Student t-test

A hybrid of Monte Carlo and Hidden Markov Models

Sunday, June 3rd, 2007

Hidden Markov models obviously fit quite a lot of biology problems. However, there is one thing that slow its application, well, in my humble opinion. It is the local minimum problem, HMMs are very easy to stuck into local minimum. Among all the approaches to global minimum, one is Monte Carlo (MC) simulation. MC is not a secret stuff to at least my physicist fellows. The advantage of using MC to do the sampling is that we could generate massive searches and potentially hit the global minimum. And I also had to go back to C/C++ because of its efficiency. I kinda like C/C++ in some aspects.

The basic idea for the hybrid is that, in labeled HMMs, we have observed states for each sequences, in other words, the states are not “hidden” anymore, what we need for prediction are the two matrices, the emission and transition matrix. So, the problem evolves to finding the optimal matrices for the HMMs which can identify the observed states with the best accuracy (not necessary 100%, we hope so though.). The ultimate problem is then how to efficiently sample the probability space ( = emission + transition space), which is a usually high dimension space (in my case, it is at least 21-dimension.). Speaking of sampling a high dimensional space, one’s first reaction would be Monte Carlo.

Monte Carlo actually quite fits to such a approach, especially, local minimum is anywhere. The core idea is that we will still accept worse movement (compared to the current state) in terms of a judgment (accuracy for my approach here). By doing that, the system will try very diverse state and select a path to, hopefully achieve the global minimum. The price to be paid for such a nondeterministic approach is searching time.