Archive for August, 2010

Big improvement for R: handling larger data

Wednesday, August 11th, 2010

I remember about two years ago, R couldn’t handle larger than 2(4)GB memory on windows, well on linux is unlimited. About 1 year ago, it couldn’t cluster a 10,000*10,000 matrix data. Now the news is out that R can tackle terabye-class data. This is very good news.

Meanwhile, I have to say that if could, I would use perl or c/c++ to handle large data for the sake of reducing calculation time. If there are fair amount of calculation on large data, a better choice to me is certainly c/c++ (or Java for other people). It may take a few hours to write the calculation in R, but a few days to implement the same calculation in c++. The calculation time spending could be very different, one of my examples is that the calculation in R could last years in contrast to a couple of weeks in C++. So coding time and running time should all be considered before start coding.

a good book on data mining in blogosphere

Sunday, August 8th, 2010

“Modeling and data mining in blogosphere” is a tiny book but the content is pretty heavy. The authors are Nitin Agarwal and Huan Liu, two professors from University of Arkansas at Little Rock and Arizona State University, respectively.

Although there are too much data available, it is hard to find them. A tip from this book is that there are some datasets available for research and public usage:

  1. Social computing data repository
  2. Spinn3r
  3. Nielsen Buzzmetric dataset
  4. TREC blog dataset

Maybe you could do something about them.

From a blogger point of view, there are two issues that at least I care the most, one is to reach more audience, the other is to find the community which means talking to my own people. Otherwise, a blog is just a personal journal with a few readers. Well, that is pretty much this blog. :(

awk trick 1: reverse words

Thursday, August 5th, 2010

Suppose a string “cp -r Documents stam”, want to reverse it to “stam Documents -r cp”. A very simple awk code:

echo “cp -r Documents stam” | awk ‘{ for (i=NF; i>0; i–) printf(“%s “,$i)}’

stam Documents -r cp

NOTE: when copy the above command and run it at command line, quotation marks need to be edited.

Keywords NB and NF are built-in variables, means the beginning number and the total number of fields in current record, respectively. NR is the total number of records read in an input file.

WOW, I didn’t realize how simple it is! I can’t claim that I have used Linux for 10 years any more! :(

Two more interesting papers

Tuesday, August 3rd, 2010

To read:

A three-dimensional model of the yeast genome(link)

Sequence space and the ongoing expansion of the protein universe (link)

Hash table and Tree

Tuesday, August 3rd, 2010

I just learned that tree could have same function as hash table. Since it is clear that I need re-learn hash table, I’d like to write this entire post on this topic. Never too late, I live to learn. :)

First of all, let me summary data structure first.

According to wikipedia, data structure is a way to organize and store data in a computer so that it can be used efficiently. In other words, data is a collection of items and the collection itself is based on some kinda “simple” rules that can be used to reach those items. There are 4 types of data structures, arrays, lists, trees, and graphs. Well, as far as data types or item types are concerned, there are 3 types, primitive types, composite types, and abstract data types. Again, they are from wikipedia.

To be continued.