## Archive for September, 2010

### Noise reduction on funnel-shaped energy landscapes

Monday, September 27th, 2010

This post is due to an excellent paper by Andrew Stumpff-Kane and Michael Feig, which was published on Proteins: Structure, Function, and Bioinformatics 63:155-164(2006). In protein structure prediction field, almost every one has his/her own scoring function to score/rank models for any target sequence,  presumbly, the model closest to the native structure should score highest/lowest. And since the closeness of the model to the native structure is usually measured by RMSD (or GDT_TS as in CASP or TMscore invented by Zhang and Scholnick’s paper), if the scoring function is perfect, there should be strong correlation between the RMSD and the score for all models. However, more often than not we saw very clumsy distruction of RSMD vs scores for CASP models, in other words, there is little or no such hoped correlation. So the authors proposed a statistical solution, correlation based scoring function to reduce the noise from original score functions. The noise, Z of original score function (W) and the score function is assumed to not correlated to the distance between model and native structure. The correlation coefficient

$\rho_{r}(d(PP_{r}), W+Z)=\frac{Cov(d(PP_{r}), W+Z)}{\sqrt Var(d(PP_{r})(Var(W)+Var(Z))}$

They found that the correlation of $\rho$ to $d(PP_{0})$ is not dependent on Z anymore, that is

$\rho(d(PP_{0}), \rho_{r}(d(PP_{r}), W+Z))=\frac{Cov()}{\sqrt(Var(d(PP_{0}))Var()}$

So the proposed score of each model is calculated as:

$r_{i}=\frac{N\sum_{j \neq i}^{N}s_{j}d_{ij}-\sum_{j\neq i}^{N}s_{j}\sum_{j\neq i}^{N}d_{ij}}{\sqrt(N^{2} Var(s)Var(d))}$ where

$d_{ij}$ is the distance between model i and model j, $s_{i}$ is the original score of model i, N is the total number of models.

It works well on 5 data sets they chose. One of the reasons it works is that it uses the assumption that all the models are near by or at the native structure and their distribution is a funnel-like, that is, there is a global minimum. So the correlation score would weight the model with closest to global minimum the largest score. In reality, they found that it is better to use a hybrid of the original score with this correlation based score. That is, to use the correlation based score to select a limited number of models (for example 10), and then use the original scoring function to rank the preselected models. And this hybrid turns out to be better than either.

Again the assumption is that all models form funnel like distribution on energy landscape.

### plot error bars in R

Friday, September 10th, 2010

For example, there is a group of measurements, we would like to divide them into consecutive subgroups then plot the mean values, and also plot the associated uncertainties for each mean values. Suppose we have two groups of measurements, d1, and d2. Both have 30 measurements. We want to use 3 points for each group, that is, average every 10 measurements for each group. Here is how to do it in one way:

library(psych)

g=c(rep(1,10), rep(2,10), rep(3,10))

error.bars.by(d1[1:length(g)], g, TRUE, xlab=”Time”, ylab=”Pressure”, main=”W->L: 10ps”,col=2, colors=2,pch=1)

error.bars.by(d2[1:length(g)], g, TRUE, xlab=”Time”, ylab=”Pressure”,col=3, colors=3, pch=2,add=T)
legend(x=1,y=max(c(d1,d2)), legend=c(“Native”, “Mutant”), col=2:3,pch=1:2)

### The book about Warren Buffet

Thursday, September 9th, 2010