Archive for the ‘scripts & tips’ Category

mvn build with dependencies

Wednesday, December 31st, 2014

mvn clean assembly:assembly -DdescriptorId=jar-with-dependencies

Install Spark on AWS EC2 Ubuntu instances

Friday, December 19th, 2014

It has been annoying to install spark a couple of times, finally, I decide to document the process.

After the new instance is spin-up, here is the list, you might want to skip some if already installed.

sudo apt-get update
sudo apt-get install git

#install java 8
#sudo apt-get install software-properties-common
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

#install scala
wget http://www.scala-lang.org/files/archive/scala-2.11.4.deb
sudo dpkg -i scala-2.11.4.deb
sudo apt-get update
sudo apt-get install scala

#install sbt
wget http://dl.bintray.com/sbt/debian/sbt-0.13.6.deb
sudo dpkg -i sbt-0.13.6.deb
sudo apt-get update
sudo apt-get install sbt

#may need change to the latest and stable version if needed, it is the latest as of December 19, 2014
git clone https://github.com/apache/spark.git
cd spark
sbt/sbt clean assembly
#it took about 8 minutes on my laptop

#Is it ready?

./bin/spark-shell

A small OpenMP practice

Monday, October 21st, 2013

I had to parallelize a piece of code. The requirements were, single node with multiple cores, RAM might be limited based on data sets. OpenMP seems a good fit for my purpose, here is my solution:

	
	#pragma omp parallel num_threads(MAX_THREADS_NUMBER)
	{
		int total_threads = omp_get_num_threads();
		int ID = omp_get_thread_num();
		int block = (int) N / (double) total_threads;
		int start = ID * block;
		int end = (ID + 1) * block; 
		if (ID == (total_threads - 1)) {end = N;}

		VpTree* tree = new VpTree(X);

		for(int n = start; n< end; n++) {
		        search(n, tree);
		}
		delete treeC;
	}

At beginning, the start and end for the loop segment are manually defined. Well, this is not necessary in some cases since one could use

	#pragma omp parallel for
	for(int n = 0; n< N; n++) {
	        search(n, tree);
	}

But the problem was that, in “search()” there are a bit messy/complicated/un-neat reading of the pointer tree. So I always ended up “Segmentation fault”. That’s why I was duplicating the pointer “tree” for each thread. The max number of threads allowed is pre-defined in order to comply with RAM issues. There might be more elegant solution, but I’m easy to satisfy.

Last but not least, I benefited a lot from the presentation of Tim Mattson and Larry Meadows from Intel: A “Hands-on” Introduction to OpenMP.

Install zlib1g-dev on Amazon Linux AMI release 2012.09

Tuesday, March 19th, 2013

 

yum groupinstall "Development Tools" 
yum groupinstall "Development Libraries"

replace ctrl-A character

Thursday, September 6th, 2012
tr '\01' '\t' < file.in > file.out

maxsum code

Friday, May 27th, 2011

Problem: Given an array of integers, each element can be positive, zero or negative, find a sum of a segment of the elements in the array so that its value is the maximum.

For example, {3, -1, 4, -10,5}, the result should be:

maxsum=6 <= 3-1+4.

Solution:
There are two conditions on the unsorted integer array, one is the sum should be maximum and the other is the sum should be among all consecutive elements. Naturally, there are some segments of elements that their sum are larger than others, so we need be able to find where to stop and start the segment so that its sum is maximum. So we need two variables to store segment sum values for comparison, when we move along the array, one variable store the maximum for current segment, and the other is the maximum so far along the array. And when moving along the array, we also need a flag to tell us whether it is time to compare the current maximum against the maximum so far. Now comes to the key question, when to stop and start?

When to stop?
We start with the first element, if it is non-negative, we for sure keep it in the sum, but otherwise, we need make a decision to either “foresee” it or stop here. “Foresee” here means, when its next is so positive that, a[0]+a[1] is non-negative, so we will overall better off with such a negative element in “longer” term. If a[0]+a[1] is negative, we’ll have no choice but stop the current sum. We can easily derive that this stop rule is a general rule applying for starting at any position.

When to start new segment?
After the stop, we need a new start. We obviously want to start with a positive element because for sure it is contributing positively to the new sum. And after the new start, we follow the stop rule described above. When this new segment is stop, we compare the new sum against the previous sum on previous segment and keep the larger one. Later we just keep the largest sum and compare any new max against it, doing so we’ll get the max for the whole array.

The following is my implementation in C/C++:

#include <iostream>
using namespace std;
int sumMax(int *a, int n) {
	if(n==0){
		cout << "Error: the array is empty!\n";
	}
	int max1, max2;
	int flag=-1;
	int k=n;
	max1=max2=0;
	while(k>0) {
		if(*a >= 0) {
			if(flag==1) {
				max2+=(*a);
			} else {
				max1+=(*a);
			}
			a++;k--;
		} else {
			if(k-1>0) {
				if(*a+*(a+1)>=0){
					if(flag==1) {
						max2+=(*a+*(a+1));
					} else {
						max1+=(*a+*(a+1));
					}
					a+=2;k-=2;
				} else {
					flag=1;
					a++;k--;
					if(*a >=0) {
						max2=(*a);a++;k--;
					}
				}
			}
		}
		if(max2>=max1) {
			max1=max2;
			flag=-1;
		}
	}
	return max1;
}

int main(void) {
	int a[]={4, -1, 3, -10, 1, 25, -11, 30, -10, 50};
	int i;
	int max=sumMax(a,10);
	for(i=0; i<10; i++) {
		cout << a[i] << " ";
	}
	cout << endl;
	cout << "max=" << max << endl;
	return 1;
}

vim settings

Tuesday, January 11th, 2011

Before I forget, some trivial tips for vi editor, modify ~.vimrc file:

set ruler “to tell where the cursor is, line and column number”

set incsearch “where search, it automatically and simultaneously match the current typing in the text”

awk trick 1: reverse words

Thursday, August 5th, 2010

Suppose a string “cp -r Documents stam”, want to reverse it to “stam Documents -r cp”. A very simple awk code:

echo “cp -r Documents stam” | awk ‘{ for (i=NF; i>0; i–) printf(“%s “,$i)}’

stam Documents -r cp

NOTE: when copy the above command and run it at command line, quotation marks need to be edited.

Keywords NB and NF are built-in variables, means the beginning number and the total number of fields in current record, respectively. NR is the total number of records read in an input file.

WOW, I didn’t realize how simple it is! I can’t claim that I have used Linux for 10 years any more! :(

Hash table and Tree

Tuesday, August 3rd, 2010

I just learned that tree could have same function as hash table. Since it is clear that I need re-learn hash table, I’d like to write this entire post on this topic. Never too late, I live to learn. :)

First of all, let me summary data structure first.

According to wikipedia, data structure is a way to organize and store data in a computer so that it can be used efficiently. In other words, data is a collection of items and the collection itself is based on some kinda “simple” rules that can be used to reach those items. There are 4 types of data structures, arrays, lists, trees, and graphs. Well, as far as data types or item types are concerned, there are 3 types, primitive types, composite types, and abstract data types. Again, they are from wikipedia.

To be continued.

STL in C++

Thursday, July 29th, 2010

STL stands for Standard Template Library in C++. It is very convenient, especially for people who use diverse languages. For example, I would assume there are functions like this:

a = “007”

as.numeric(a)==7 => should be true

In many latest languages, it is very friendly(saving finger muscle movements) that you can find such similar function doing the conversion. But back to C++, it is not the case at all. All the sudden we are back to stone age. No need to panic though. In STL, there are some functions for this purpose–converting string to number. Although still not very convenient, it is better than nothing unless you write your own (perfectly fine).  Suppose you want to pass input values for different variable at command line, there are up to 3 variables, “start”, “end”, “seed”, the STL string is quite useful for this purpose because it has strcmp and atoi to do string comparison and converting string to integers, respectively. Here is how I would do:

#include <string>

using namespace std;

int main(int argc, char *argv[])
{
int StartI, EndI, Seed;

if (argc ==  1){
cout << “Error: no command line inputs!” << endl;
exit(0);
}

for (int ai=1; ai < argc; ai++){
if (strcmp(argv[ai], “-start”)==0)
StartI = atoi(argv[++ai]);
else if (strcmp(argv[ai], “-end”)==0)
EndI = atoi(argv[++ai]);
else if (strcmp(argv[ai], “-seed”)==0)
Seed = atoi(argv[++ai]);
else
cout << “Unknown command line inputs!” << endl;
}
}

There are surely other excellent STLs, for example, vector, list,… But I personally haven’t used them yet since if possible I use other languages (PERL and R) to do such things instead.