Archive for February, 2011

A video from the Creator of Rhipe

Monday, February 21st, 2011

It is good to hear the creator talked about Rhipe.

RHIPE: An Interface Between Hadoop and R
Presented by Saptarshi Guha

Video Link

Reservoir sampling in perl

Saturday, February 12th, 2011

I got a very large data file and tried to randomly select some lines. Other than using built-in sampling functions in R or perl, there is one very elegant algorithm exists, it is called Reservoir sampling. Here is a piece of perl script that I implement to do it:
#!/usr/bin/perl
use strict;
use warnings;

#Reservoir_sampling a large file to get N lines at equal probability

sub ResSampleFile {
my ($file, $N) = @_;
open(my $FH, “<$file”) || die “Couldn’t open $file!$!\n”;
my $count=0;
my @ret;
while (my $line=<$FH>) {
$count++;
if($count<=$N) {
push @ret,$line;
} else {
my $random=int(rand($count))+1;
if ($random <= $N) {
$ret[$random-1]=$line;
}
}
return(@ret);
}
}

#Usage:
#ResSampleFile(“abc.txt”, 100);

Install Rhipe (R+Hadoop) on ubuntu

Sunday, February 6th, 2011

Rhipe is an excellent package on integrating Hadoop and R enviroment and developed by Saptarshi Guha. This is a very good trend for many bioinformatics people. More and more we’ll have tremendous amount of data piling up at TB data daily. Any queries against such growing data will be daunting jobs. MapReduce which was developed by Google seems a good solution for such tasks. It uses distributed file systems to store data across nodes and then use the MapReduce algorithm to do the search very efficiently. Hadoop is the open source version of Google’s MapReduce. On the other end, R is also a star in statistics community, in particular in academia. Now we have the integration of both R and Hadoop, Rhipe. Here is how to install them.

Install Hadoop on Ubuntu
There is one very good post regarding the installation on single node ubuntu and I followed it and was able to install it. One note is that use sudo!
Make sure to change the environment settings(for example, if bshell, change .bashrc file under home directory). Suppose “/home/xyz/hadoop” is the directory where hadoop is installed, then

export HADOOP=/home/xyz/hadoop
export HADOOP_BIN=/home/xyz/hadoop/bin/hadoop

Install Google’s Protocol Buffers
Although the installation instruction is good, the following “modification” will make life certainly much easier.
After download the package and unzip it, suppose we are at the unzipped directory, then do the following:

sudo ./configure --prefix /usr
export LD_LIBRARY_PATH=/usr/local/lib
sudo make
sudo make install

Install Rhipe
Download the source code and then

sudo R CMD INSTALL Rhipe_0.65.2.tar.gz

Test Rhipe
Start Hadoop first, bin/start-all.sh, then start R and type in:
library(Rhipe)
rhinit()

if you see
“Rhipe initialization complete
Rhipe first run complete
[1] TRUE”

You are official at Rhipe and enjoy Hadoop and R. Otherwise, tell us what you got …

k-Nearest Neighbor and K-Means clustering

Saturday, February 5th, 2011

These two are arguably the two commonly used cluster methods. One of the reasons is that they are easy to use and also somehow straightforward. So how do they work?

k-Nearest Neighbor:
Provide N n-dimension entries with known associated classes for each entry, the number of classes is k, that is, \{\vec{x_{i}}, y_{i}\}, \vec{x_{i}}\in \Re^{n}, y_{i}=\{c_{1}, .., c_{k}\}, i=1,..N. For a new entry, \vec{v_{j}}, which class should it belong? We need use a distance measure to get the k closest entries of the new entry \vec{v_{j}}, the final decision is simple majority vote based the closest k neighbors. The distance metric could be euclidean or other similar ones.

K-means:
Given N n-dimension entries and classify them in k classes. At first, we randomly choose k entries and assign them to k clusters. They are the seed classes. Then we calculate the distance between each entry and each class. Each entry will be assigned into one class in terms of the its distance to each class, i.e., assign the entry to its closest class. After the assignment is complete, we then calculate the centroid of each class based on their new members. After the centroid calculation, we go back to the distance calculation and therefore new round classification. We stop the iteration when there is convergence,i.e,, no now centroid and classification.

The two methods are all semi-supervised learning algorithms because they do need we provide the number of clusters prior the clustering.