Archive for the ‘CompNotes’ Category

A video from the Creator of Rhipe

Monday, February 21st, 2011

It is good to hear the creator talked about Rhipe.

RHIPE: An Interface Between Hadoop and R
Presented by Saptarshi Guha

Video Link

Reservoir sampling in perl

Saturday, February 12th, 2011

I got a very large data file and tried to randomly select some lines. Other than using built-in sampling functions in R or perl, there is one very elegant algorithm exists, it is called Reservoir sampling. Here is a piece of perl script that I implement to do it:
#!/usr/bin/perl
use strict;
use warnings;

#Reservoir_sampling a large file to get N lines at equal probability

sub ResSampleFile {
my ($file, $N) = @_;
open(my $FH, “<$file”) || die “Couldn’t open $file!$!\n”;
my $count=0;
my @ret;
while (my $line=<$FH>) {
$count++;
if($count<=$N) {
push @ret,$line;
} else {
my $random=int(rand($count))+1;
if ($random <= $N) {
$ret[$random-1]=$line;
}
}
return(@ret);
}
}

#Usage:
#ResSampleFile(“abc.txt”, 100);

k-Nearest Neighbor and K-Means clustering

Saturday, February 5th, 2011

These two are arguably the two commonly used cluster methods. One of the reasons is that they are easy to use and also somehow straightforward. So how do they work?

k-Nearest Neighbor:
Provide N n-dimension entries with known associated classes for each entry, the number of classes is k, that is, \{\vec{x_{i}}, y_{i}\}, \vec{x_{i}}\in \Re^{n}, y_{i}=\{c_{1}, .., c_{k}\}, i=1,..N. For a new entry, \vec{v_{j}}, which class should it belong? We need use a distance measure to get the k closest entries of the new entry \vec{v_{j}}, the final decision is simple majority vote based the closest k neighbors. The distance metric could be euclidean or other similar ones.

K-means:
Given N n-dimension entries and classify them in k classes. At first, we randomly choose k entries and assign them to k clusters. They are the seed classes. Then we calculate the distance between each entry and each class. Each entry will be assigned into one class in terms of the its distance to each class, i.e., assign the entry to its closest class. After the assignment is complete, we then calculate the centroid of each class based on their new members. After the centroid calculation, we go back to the distance calculation and therefore new round classification. We stop the iteration when there is convergence,i.e,, no now centroid and classification.

The two methods are all semi-supervised learning algorithms because they do need we provide the number of clusters prior the clustering.

Support vector machines soft margin primal and dual problems derivation

Saturday, January 22nd, 2011

Support vector machines (SVM) have been used extensively in machine learning. In particular the soft margin case which allows misprediction with a penalty associated has been applied to many problems spanning a variety of fields. Here let’s study how the primal problem is reformulated in the dual way so that it is easier to understand its solution.

Suppose we have x_{i}  inputs and corresponding measurement y_{i}, and solution w should be such that we can separate inputs according to their measurement. The standard SVM primal formulation is:

 min \frac{1}{2}w^{T}w +C \sum_{i}\omega_{i}

s.t. y_{i}(w^{T}x_{i}+b) \geq 1-\omega_{i}

\omega_{i} \geq 0

We introduce Lagrange multiplier so:

min_{(w,b,\omega)} max_{(\alpha, \gamma)} \{ \frac{1}{2}w^{T}w +C \sum_{i}\omega_{i}-\sum_{i}\alpha_{i}(y_{i}(w^{T}x_{i}+b)-1+\omega_{i})-\sum_{i}\gamma_{i}\omega_{i}\}

why maximize \alpha and \gamma because otherwise +\infty is their trivial solution.

and \alpha_{i}\geq 0,\gamma_{i} \geq 0.

The saddle point conditions are:

\frac{\partial L}{\partial w}=0, \frac{\partial L}{\partial b}=0, \frac{\partial L}{\partial \omega}=0.

The solution is:

w=\sum_{i}\alpha_{i}y_{i}x_{i}, \sum_{i}\alpha_{i}y_{i}=0, and 0 \leq \alpha_{i} \leq C.

Let’s rewrite the Lagrange formula:  max_{\alpha} \sum_{i}\alpha_{i}-\frac{1}{2}\sum_{i}\sum_{j}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}^{T}x_{j}

with \sum_{i}\alpha_{i}y_{i}=0, 0 \leq \alpha_{i} \leq C. There are no C and \omega in the maximum function anymore, is that magic?!

Reference:

1, Wikipedia

2, The Top Ten Algorithms in Data Mining by Wu, Xindong