Archive for May, 2007

Hidden Markov Models

Monday, May 28th, 2007

Hidden Markov Models (HMMs) is one of the Markov models, and probably the most broadly used Markov models in biology. The very essence of Markov model is that it introduce the relation between the current step and its previous in a system, which is kinda natural way of reasoning, or according to conventional wisdom, it should be so. The previous state of the system have a strong impact on the current state of the system. Well, one could argue that, not only previous one step/state, but also many other previous if not all should also to some extent influence the behavior of the current step/state. It is true that we should also consider more than one previous step/state, but from a computational point of view, that is too expensive to consider. Another reason is that, I think, from a point of view of decay, the correlation between neighboring steps/states is dominated by the nearest steps/states. If and only if we strongly think the contribution from next nearest steps/states can not be ignore, otherwise, it is safe and economic to just consider the nearest contribution.

HMMs can be assigned to two classes, again, from my biology experience, one is for the case that the states of a given sequence are not labeled, and the other one is that the states are labeled. Currently, there are many available packages for the non-label case, such as, the Statistics Toolbox in Matlab, UMDHMM, etc. One could easily format their problems fitting to what required by the methods and then use them. However, to the best of my knowledge, there is no any available packages for the labeled case.

I’m working on developing a package for my problem, which is, unfortunately/fortunately, a labeled HMM problem.

DNA bendability

Sunday, May 20th, 2007

It had been a basic conception that short DNA segments, typically whatever shorter than 500bps can be wrapped around histones only through assistance from other proteins. In other words, bending shorter DNA segments is against energetic hill from the thermodynamics point of view. However, the concept has been seriously challenged in an experiment paper by Cloutier and Widom.

DNA twisting flexibility and the formation of sharply looped protein-DNA complexes
Cloutier, T.E. and Widom, J., PNAS 102, 3645-3650 (2005)

They observed that the inherent twistability of 89- to 105-bp DNA circles is as larger as 400 times of the expection (based on “old” concepts and theories). Nonetheless, they stated that the experiment support two new theories regarding DNA bendability. The two theories are

Localized single-stranded bubble mechanism for cyclization of short double helix DNA
Yan, J. and Marko, J. F., PRL 93, 108108 (2004)

Exact theory of kinkable elastic polymers
Wiggins, P.A., Phillips, R., and Nelson, P. C., PRE 71, 021909 (2005)

The two theories all introduce an energy function which depends on the length of the DNA and the tangent vectors describing the segment orientation along the DNA loop. Starting from the energy function, they are very physics, I mean, they used different approaches to solve the “physics” problem. Such as, Yan and Marko built a transfer matrix and then numerically solved it. Wiggins et. al made a partition function and then solved it. In the end, they all show that the bendability is length dependent and they explained the cyclization of short DNA sequences.

The theories are very good and relevant. But I have to say that much more work from the theoretical side should be done. Because the key of the DNA bendability is that it is sequence dependent. Even without the assistance from other proteins or molecules, the bendability is still very sensitive to the sequence of the DNA segment. It is the sequence that it is the key benefitial/purpose of this bending, which is that the sequence is either protected from the “unpleasent” environment or reserved to funtion in designed ways. Molecular dynamics could be one approach to answer this question, but it is too expensive and empirical in some sense. So, until some new ground break theories are made, we have to wait.


Thursday, May 10th, 2007

Classification has been intensively applied to biology and medicine. It can be used to either assign a new biological or clinical data into known classes or predict the output classes for the data. The number of classes of the data can be any size. Therefore, the classification approaches can be cast with two groups, one is binary classification and the other is multicategory classification in which there are more than 2 classes. On the other hand, the methods which are used to solve the classification problems can be assigned into two groups in terms of the usage of support vector machines (SVMs). One group is the non-SVM based method while the other is SVM based method. The popular candidates for non-SVM based methods are K-means, K-nearest neighbors (KNNs), artificial neural networks (NNs) and probabilistic neural networks (PNNs). The choices for SVM based methods are various depending on the number of classes and hence the learning strategies for different classes. I’d like to discuss a little bit more about both non-SVM and SVM based classifiers.

1. SVM based classifiers
Support vector machines have been implemented by several groups and successfully used broadly in computational biology/medicine, economics, etc. The main attracting point of SVM, to me, is that it is simple to use and interpret. The main SVM solvers are SVM Light and libSVM. I have good things to say about both of them.

1.1 Binary SVMs
1.2 Multicategory SVMs: one-versus-rest (OVR)
1.3 Multicategory SVMs: one-versus-one (OVO)
1.4 Multicategory SVMs: DAGSVM
1.5 Multicategory SVMs: Weston and Watkins (WW)
1.6 Multicategory SVMs: Crammer and Singer (CS)

2. Non-SVM based classifiers

2.1 K-means
2.2 K-nearest neighbors (KNNs)
2.3 Artificial neural networks (NNs)
2.4 Probabilistic neural networks (PNNs)

[to be continued]

A talk on plant evolution

Wednesday, May 2nd, 2007

There was a talk in the Beadle center this afternoon. Dr. John Doebley, an established geneticist presented his work on maize and its evolution. He has been doing excellent work in crop domestication, evolutionary genetics, etc. I learned a lot from his talk. He started with three questions, which I hope I get them right.

1. How many genes control a new phenotype?
2. What kind of genes involved in a new phenotype?
3. What is the nature of the changes in these genes?

At the end of his talk, he gave his answers based on his research on teosinte and maize.
1. There are a couple of genes (not hundreds or thousands).
2. The majority are transcriptional regulators, and a few cell-to-cell signaling genes.
3. a) Regulatory
b) Change protein function (loss/activation of function)

It reminds me of that, once again, the transcriptional regulatory network and the signaling network are very important.

Prediction of Protein-Protein interactions

Tuesday, May 1st, 2007

There is a trend in computational biology, people expand their “mature” methods bit by bit to reach real biology. For example, protein structure prediction has been intensively investigated for decades, which the subjects range from 1-D (protein prediction), 2-D (secondary structure prediction) to 3-D (3-D structure prediction). The 3-D structure is the critical intermediate goal toward understanding the function of proteins because from structure we could the associate rates of protein with other molecules. The underlying reason is that down to the atomic scale, regarding the complexity of the phenotypes of the organisms, everything is just interaction among atoms that the organisms are made. It is kinda bottom-up strategy. Unfortunately, reconstruction of the function of the organisms from “ab initio“, or protein folding problem is not that trivial at all. Usually, the barrier is the correctness of the potential parameters (assume we don’t want to use Quantum mechanics to approach it), computationally expensive. In the meantime, people have used homologs (evolutionary information) to improve the secondary prediction and achieve great success. The accuracy could be near 80% for any given sequences. Then one day genius guys said, why not mix them together, use both evolutionary information and the bottom-up approaches to improve the protein folding prediction. After the great success of Rosetta, there are just many pretty good 3-D predictors. In general, proteins with a few hundred amino acids could be predicted with reasonably high accuracy.
Another big question for folding people is that, the phenotypes is actually more likely to be understood (simulated) based on the interaction between proteins which are its function units. So, people said, OK, how about using the mature structure prediction to identify the protein-protein interactions. Once again, a small step away from the protein structure prediction, but one step closer to the more biology question. Again, there are many available examples for the predictors using this approach.
However, there is one new approach caught my sleepy eyes. Well, it is new to me.

Predicting protein-protein interactions based only on sequences information
Shen, J., Zhang, J., et. al.,
PNAS 104, 4337-4341 (2007)

I just brief the implementation of the method here.
1, Category 20 amino acids into 7 classes according to their physical-chemical properties.
2, Group the protein sequences into triads and count the frequencies for each triad, so, each protein is represented with a 7*7*7=343 dimensional vector, where the element of the vector is the frequency of that triad.
3, convert frequencies to distances by standardize the frequency vector: d_i = (f_i -min(f_1, f_2, …, f_343)/max(f_1, f_2, …, f_343)
4, If proteins A and B interact with each other, D_AB=D_A * D_B = (d_A_1, d_A_2, …, d_A_343, d_B_1, d_B_2, …, d_B_343) and it is a positive example. If proteins I and J interact together, then a 686 vector for D_IJ and it is considered as another positive example. Negatives will be any two of D_AI, D_AJ, D_BI, and D_BJ, of course, assume there is no interactions between A and I/J or between B and I/J.
5, Develop a customized SVM kernel to do classification.
6, Accuracy is pretty high.