Vector Space Models in Text Analysis

January 18th, 2012

This is a very good paper on vector space models in text analysis. I’d like to summary a bit here. One of the major purpose of text analysis is to enable machine understanding human language.

There are currently three ways to understand text in machine world, term-document, word-context, and pair-pattern.

A post about memory in R

December 20th, 2011

Very good discussion about memory issues in R.

Hello Chicago!

August 27th, 2011

My next stop is Chicago. I never thought of staying in Chicago, but here I am. So it is true that life is full of surprises. And as always, lots of hopes and dreams…

Machine Learning Summer School 2009 Cambridge

August 4th, 2011

This is good.

A new machine learning challege on Kaggle

June 29th, 2011

This time it is a wikipedia’s participation challenge, which is to predict how many edits an editor will make in 5 months. It is very intrigue…

Someday I should set up a challenge myself, how often I update my blog! Given the fact that bloggers are tend to be older than facebkookers and twitters, and older people tend to do things slower, a hint for an excellent model is that it should also incorporate that!

Interview series II: sum of two elements

June 29th, 2011

I was asked such an interview question.

Given an unsorted array of positive integers, and also a target value K, find two elements that sum up to K.

So I suggested, sort the array first, and then start from the last (maximum) element that is smaller than K, x_max and then use binary search to find K-x_max. Well the complexity for sorting is O(n log n) and for finding is another O(n log n). Any improvement if the array is sorted? I stuck there, 5 seconds later, he tipped me with something like how about using two index pointers. WOW, great hint. Now the complexity for this step is O(n). Below is one solution.

#include <iostream>

using namespace std;

void quickSort(int *arr, int left, int right) {
	if(left>=right) {return;}
	int Start=left;
	int End = right;
	int pivot=(left+right)/2;
	int pivotV=*(arr+pivot);
	int tmp;

	while(left<right) {
		if(*(arr+left) <= pivotV) {
			left++;
		}
		if(*(arr+right)>=pivotV) {
			right--;
		}
		if(*(arr+left)>pivotV && *(arr+right)<pivotV) {
			tmp=*(arr+left);
			*(arr+left)=*(arr+right);
			*(arr+right)=tmp;
			left++;
			right--;
		}
	}
	*(arr+pivot)=*(arr+right);
	*(arr+right)=pivotV;
	quickSort(arr, Start, left-1);
	quickSort(arr, right+1, End);
}

void ksum(int *arr, int len, int k) {
	int left, right, ktmp;
	left=0;
	right=len-1;
	while(left<right) {
		ktmp=*(arr+left)+*(arr+right);
		if(ktmp==k) {
			cout << *(arr+left) << " + " << *(arr+right) << " = " << k << endl;
			return;
		} else if(ktmp>k) {right--;} else {left++;}
	}
	cout << "Couldn't find the two elements!\n";
	return;
}

int main() {
	int a[10]={10,3,4,9,8,7,2,1,5,6};
	quickSort(a, 0, 9);
	ksum(a, 10, 12);
	return 0;
}

install twitteR on Ubuntu

June 23rd, 2011

Well, I have heard that Google is M$ yesterday, Facebook is Google today, and Twitter is Facebook tomorrow. It is still a surprise to see there is a package in R for twitter already. It is “twitteR”. In order to install it, the following is what worked for me on my Ubuntu.

twitteR can be downloaded here.
As claimed in the document, it requires some libraries installed first.

1). Install libraries for RCurl:
On Ubuntu menu,
System-> Administration->Synaptic Package Management
quick search “curl” will result all programs with curl in their names or description, select to install “curl” only or with additional programs related to libcurl development as well.

2). Install RCurl
Download it and then go to where is saved,
R CMD INSTALL RCurl_1.6-6.tar.gz

3). Install twitteR
Start R console, then install the following:
install.packages(“bitops”)
install.packages(“RJSONIO”)
install.packages(“twitteR”)

install.packages(“ROAuth”)

4), Run twitteR:
library(bitops)
library(RJSONIO)
library(RCurl)
library(twitteR)

Enjoy.

Job interview series I: Steps to answer a technical question

June 21st, 2011

I am officially on the job market and found this is a great market, lots of opportunities and as well as competition. :) This series are nothing but notes that are job interview related. I have two reference books, “Crack the coding interview” and “Programming interview exposed”.

1. Clarify the questions with the interviewer. Knowing what is asked is the basic. This step also could be tricky, sometimes, interviewers don’t know what they are asking, so it could be a step of “Let’s define a question together.”
2. Design an algorithm. Think how to solve it and come out with a solution proposal.
3, Present the idea/algorithm/solution either with pseudo-code or just white board demonstration. If the proposed is wrong, there is no base to go further. So discuss with the interviewer.
4. Code it.
5. Test and bug fixing.

An interesting post about politics and software architecture

June 12th, 2011

It is more than just fun to read it.

maxsum code

May 27th, 2011

Problem: Given an array of integers, each element can be positive, zero or negative, find a sum of a segment of the elements in the array so that its value is the maximum.

For example, {3, -1, 4, -10,5}, the result should be:

maxsum=6 <= 3-1+4.

Solution:
There are two conditions on the unsorted integer array, one is the sum should be maximum and the other is the sum should be among all consecutive elements. Naturally, there are some segments of elements that their sum are larger than others, so we need be able to find where to stop and start the segment so that its sum is maximum. So we need two variables to store segment sum values for comparison, when we move along the array, one variable store the maximum for current segment, and the other is the maximum so far along the array. And when moving along the array, we also need a flag to tell us whether it is time to compare the current maximum against the maximum so far. Now comes to the key question, when to stop and start?

When to stop?
We start with the first element, if it is non-negative, we for sure keep it in the sum, but otherwise, we need make a decision to either “foresee” it or stop here. “Foresee” here means, when its next is so positive that, a[0]+a[1] is non-negative, so we will overall better off with such a negative element in “longer” term. If a[0]+a[1] is negative, we’ll have no choice but stop the current sum. We can easily derive that this stop rule is a general rule applying for starting at any position.

When to start new segment?
After the stop, we need a new start. We obviously want to start with a positive element because for sure it is contributing positively to the new sum. And after the new start, we follow the stop rule described above. When this new segment is stop, we compare the new sum against the previous sum on previous segment and keep the larger one. Later we just keep the largest sum and compare any new max against it, doing so we’ll get the max for the whole array.

The following is my implementation in C/C++:

#include <iostream>
using namespace std;
int sumMax(int *a, int n) {
	if(n==0){
		cout << "Error: the array is empty!\n";
	}
	int max1, max2;
	int flag=-1;
	int k=n;
	max1=max2=0;
	while(k>0) {
		if(*a >= 0) {
			if(flag==1) {
				max2+=(*a);
			} else {
				max1+=(*a);
			}
			a++;k--;
		} else {
			if(k-1>0) {
				if(*a+*(a+1)>=0){
					if(flag==1) {
						max2+=(*a+*(a+1));
					} else {
						max1+=(*a+*(a+1));
					}
					a+=2;k-=2;
				} else {
					flag=1;
					a++;k--;
					if(*a >=0) {
						max2=(*a);a++;k--;
					}
				}
			}
		}
		if(max2>=max1) {
			max1=max2;
			flag=-1;
		}
	}
	return max1;
}

int main(void) {
	int a[]={4, -1, 3, -10, 1, 25, -11, 30, -10, 50};
	int i;
	int max=sumMax(a,10);
	for(i=0; i<10; i++) {
		cout << a[i] << " ";
	}
	cout << endl;
	cout << "max=" << max << endl;
	return 1;
}