Archive for the ‘R’ Category

A video from the Creator of Rhipe

Monday, February 21st, 2011

It is good to hear the creator talked about Rhipe.

RHIPE: An Interface Between Hadoop and R
Presented by Saptarshi Guha

Video Link

Install Rhipe (R+Hadoop) on ubuntu

Sunday, February 6th, 2011

Rhipe is an excellent package on integrating Hadoop and R enviroment and developed by Saptarshi Guha. This is a very good trend for many bioinformatics people. More and more we’ll have tremendous amount of data piling up at TB data daily. Any queries against such growing data will be daunting jobs. MapReduce which was developed by Google seems a good solution for such tasks. It uses distributed file systems to store data across nodes and then use the MapReduce algorithm to do the search very efficiently. Hadoop is the open source version of Google’s MapReduce. On the other end, R is also a star in statistics community, in particular in academia. Now we have the integration of both R and Hadoop, Rhipe. Here is how to install them.

Install Hadoop on Ubuntu
There is one very good post regarding the installation on single node ubuntu and I followed it and was able to install it. One note is that use sudo!
Make sure to change the environment settings(for example, if bshell, change .bashrc file under home directory). Suppose “/home/xyz/hadoop” is the directory where hadoop is installed, then

export HADOOP=/home/xyz/hadoop
export HADOOP_BIN=/home/xyz/hadoop/bin/hadoop

Install Google’s Protocol Buffers
Although the installation instruction is good, the following “modification” will make life certainly much easier.
After download the package and unzip it, suppose we are at the unzipped directory, then do the following:

sudo ./configure --prefix /usr
export LD_LIBRARY_PATH=/usr/local/lib
sudo make
sudo make install

Install Rhipe
Download the source code and then

sudo R CMD INSTALL Rhipe_0.65.2.tar.gz

Test Rhipe
Start Hadoop first, bin/start-all.sh, then start R and type in:
library(Rhipe)
rhinit()

if you see
“Rhipe initialization complete
Rhipe first run complete
[1] TRUE”

You are official at Rhipe and enjoy Hadoop and R. Otherwise, tell us what you got …

plot error bars in R

Friday, September 10th, 2010

For example, there is a group of measurements, we would like to divide them into consecutive subgroups then plot the mean values, and also plot the associated uncertainties for each mean values. Suppose we have two groups of measurements, d1, and d2. Both have 30 measurements. We want to use 3 points for each group, that is, average every 10 measurements for each group. Here is how to do it in one way:

library(psych)

g=c(rep(1,10), rep(2,10), rep(3,10))

error.bars.by(d1[1:length(g)], g, TRUE, xlab=”Time”, ylab=”Pressure”, main=”W->L: 10ps”,col=2, colors=2,pch=1)

error.bars.by(d2[1:length(g)], g, TRUE, xlab=”Time”, ylab=”Pressure”,col=3, colors=3, pch=2,add=T)
legend(x=1,y=max(c(d1,d2)), legend=c(“Native”, “Mutant”), col=2:3,pch=1:2)

Big improvement for R: handling larger data

Wednesday, August 11th, 2010

I remember about two years ago, R couldn’t handle larger than 2(4)GB memory on windows, well on linux is unlimited. About 1 year ago, it couldn’t cluster a 10,000*10,000 matrix data. Now the news is out that R can tackle terabye-class data. This is very good news.

Meanwhile, I have to say that if could, I would use perl or c/c++ to handle large data for the sake of reducing calculation time. If there are fair amount of calculation on large data, a better choice to me is certainly c/c++ (or Java for other people). It may take a few hours to write the calculation in R, but a few days to implement the same calculation in c++. The calculation time spending could be very different, one of my examples is that the calculation in R could last years in contrast to a couple of weeks in C++. So coding time and running time should all be considered before start coding.

Passing arguments from command line in R

Thursday, July 1st, 2010

Suppose need to pass three arguments to R from command line. In R script,

args <- commandArgs()

x1 <- args[3]

x2 <- args[4]

x3 <- argx[5]

Then run from command line: R –vanilla < script.R x1 x2 x3

Machine learning in R

Tuesday, June 29th, 2010

Brilliant people have contributed a lot to including very cool functions in R, which is one of the huge boosts to its growing popularity both in academia and industry. I watched the expansion closely because myself is a beneficiary. The current list of machine learning is:

k-means

k-nearest neighbor

neural networks (one hidden layer only, and kernel is limited)

Support Vector Machine (regression and classification, it is R version of libsvm)

Random Forest for feature selection (not tried myself yet)

Bayesian inference (not tried myself yet)

I’d like to see this trend going on and on and particularly, if the added on function requires the least amount of user inference would be the best to attract more users.

Keep open source programs alive.