This post is due to an excellent paper by Andrew Stumpff-Kane and Michael Feig, which was published on Proteins: Structure, Function, and Bioinformatics 63:155-164(2006). In protein structure prediction field, almost every one has his/her own scoring function to score/rank models for any target sequence, presumbly, the model closest to the native structure should score highest/lowest. And since the closeness of the model to the native structure is usually measured by RMSD (or GDT_TS as in CASP or TMscore invented by Zhang and Scholnick’s paper), if the scoring function is perfect, there should be strong correlation between the RMSD and the score for all models. However, more often than not we saw very clumsy distruction of RSMD vs scores for CASP models, in other words, there is little or no such hoped correlation. So the authors proposed a statistical solution, correlation based scoring function to reduce the noise from original score functions. The noise, Z of original score function (W) and the score function is assumed to not correlated to the distance between model and native structure. The correlation coefficient

They found that the correlation of to is not dependent on Z anymore, that is

So the proposed score of each model is calculated as:

where

is the distance between model i and model j, is the original score of model i, N is the total number of models.

It works well on 5 data sets they chose. One of the reasons it works is that it uses the assumption that all the models are near by or at the native structure and their distribution is a funnel-like, that is, there is a global minimum. So the correlation score would weight the model with closest to global minimum the largest score. In reality, they found that it is better to use a hybrid of the original score with this correlation based score. That is, to use the correlation based score to select a limited number of models (for example 10), and then use the original scoring function to rank the preselected models. And this hybrid turns out to be better than either.

Again the assumption is that all models form funnel like distribution on energy landscape.