It is good to hear the creator talked about Rhipe.

**RHIPE: An Interface Between Hadoop and R**

Presented by Saptarshi Guha

It is good to hear the creator talked about Rhipe.

Presented by Saptarshi Guha

I got a very large data file and tried to randomly select some lines. Other than using built-in sampling functions in R or perl, there is one very elegant algorithm exists, it is called Reservoir sampling. Here is a piece of perl script that I implement to do it:

`#!/usr/bin/perl`

use strict;

use warnings;

**#Reservoir_sampling a large file to get N lines at equal probability**

**sub ResSampleFile {
my ($file, $N) = @_;
open(my $FH, “<$file”) || die “Couldn’t open $file!$!\n”;
my $count=0;
my @ret;
while (my $line=<$FH>) {
$count++;
if($count<=$N) {
push @ret,$line;
} else {
my $random=int(rand($count))+1;
if ($random <= $N) {
$ret[$random-1]=$line;
}
}
return(@ret);
}
}**

**#Usage:
#ResSampleFile(“abc.txt”, 100);**

These two are arguably the two commonly used cluster methods. One of the reasons is that they are easy to use and also somehow straightforward. So how do they work?

**k-Nearest Neighbor:**

Provide N n-dimension entries with known associated classes for each entry, the number of classes is k, that is, For a new entry, , which class should it belong? We need use a distance measure to get the k closest entries of the new entry , the final decision is simple majority vote based the closest k neighbors. The distance metric could be euclidean or other similar ones.

**K-means:**

Given N n-dimension entries and classify them in k classes. At first, we randomly choose k entries and assign them to k clusters. They are the seed classes. Then we calculate the distance between each entry and each class. Each entry will be assigned into one class in terms of the its distance to each class, i.e., assign the entry to its closest class. After the assignment is complete, we then calculate the centroid of each class based on their new members. After the centroid calculation, we go back to the distance calculation and therefore new round classification. We stop the iteration when there is convergence,i.e,, no now centroid and classification.

The two methods are all semi-supervised learning algorithms because they do need we provide the number of clusters prior the clustering.

Support vector machines (SVM) have been used extensively in machine learning. In particular the soft margin case which allows misprediction with a penalty associated has been applied to many problems spanning a variety of fields. Here let’s study how the primal problem is reformulated in the dual way so that it is easier to understand its solution.

Suppose we have inputs and corresponding measurement , and solution w should be such that we can separate inputs according to their measurement. The standard SVM primal formulation is:

We introduce Lagrange multiplier so:

why maximize because otherwise is their trivial solution.

and .

The saddle point conditions are:

The solution is:

Let’s rewrite the Lagrange formula:

with There are no C and in the maximum function anymore, is that magic?!

Reference:

1, Wikipedia

**2, The Top Ten Algorithms in Data Mining by **Wu, Xindong