Port binding in DigitalOcean Ubuntu

February 23rd, 2014

I had trouble with binding port 80 in DigitalOcean Ubuntu and was rescued by this page on stackoverflow. Following the page, one needs is :

sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000

Then edit the file "/etc/rc.local”, notice that it is NOT other file suggested on the webpage. Editing means add the above command with small modification:

iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000

So far, it solved my problem and everything is working.

 

Limit RAM size used by redis

February 18th, 2014

If you want to limit the max size of RAM that is allocated to redis, you need change the file named “redis.conf” under the path where redis is installed. The trick is that when you run “./redis-server”, you have to run it from where the “redis.conf” file is. Unless you specifically call that modified “redis.conf” when start “redis-server”, it won’t use the modified file at all. What I did, for example, allocate only 100MB RAM to redis:

Just add one line: “maxmemory 100M” into “redis.conf”. Then start, “./src/redis-server”. It should work. But be aware of possible consequences of such restriction, “out of memory error” might come uninvitedly…

 

Comparing PaaS, Heroku vs Elastic Beanstalk vs Linode vs DigitalOcean?

February 17th, 2014

This part, I really did my homework after big time spent on try and error. Google PaaS doesn’t support Java, so sorry, out of the list.

The free tier of Heruko and Elastic Beanstalk. It works great if you app is relatively small, that is, 300MB for Heroku and 512MB for T1.micro for EB. If files larger than those two sizes, you simply won’t be able to submit. When I was blindly trying to fix my upload problem, there were no obvious statement about the two size anywhere ( as least where I looked.) In EB, the T1.micro will allocate at max 650MB RAM. If more, you will need upgrade for fee version.

MEAN infrastructure. They both have excellent support for these, in particular Heroku, users generally don’t need use npm install at all. They also have so many integrated libraries available thru their web. You could simply click on “add”. Also, Redis is available on Heroku as well. However, I didn’t figured out how to ssh to change anything on the server. In contrast, EB doesn’t have Redis at all, only PostgresDB and Dynamo for DB. The good thing is that it allows you to connect to the server via ssh. And so you could do more exotic things, hopefully. One trivial but definitely very time-consuming to figure out is that, you need zip your entire app with package.json and app.js in the root directory. If your zip will create a your subdirectory, you will never be able to see your app up (at least up until now). You will be able to see your app at /var/local/current. Again, don’t ever zip your dir, zip all files and subdir instead.

I didn’t explore the price tags in Heroku, but EB/AWS is not cheap at all, in particular, compared to Linode.com and digitalocean.com. I settled down on digitalocean.com, because it offers 2GB RAM + 40GB SSD for $20/month. Their tech support is more than excellent. Any questions so far were answered within a few hours, some within 30 minutes. The nice thing with digitalocean is that they provide MEAN as well. The catch is that something you will be your own, for example, DNS binding you need figure out yourself. They do provide excellent doc for help though.

 

Lucene Ngramtokenizer

February 12th, 2014

From what I found, NgramTokenizer() in lucene is really a character tokenizer. If a gram is a word, then you need ShingleFilter(). And the result of ShingleFilter() will have following behavior:

InputText = “This is the best, buy it.”

For bigram, you will have “best buy” as a bigram.

Here is the demo of how I tested.

NgramTokenizer()

	private static Analyzer WV_ANALYZER = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new NGramTokenizer(reader, 1, 2);
			TokenStream result = new StandardFilter(matchVersion, source);
			return new TokenStreamComponents(source, result);
		}
	};

That’s the definition for the Analyzer using NgramTokenizer().

ShingleFilter()

	private static Analyzer ANALYZER_NGRAM = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new LDATokenizer(matchVersion, reader);
			TokenStream result = new StandardFilter(matchVersion, source);
			result = new LowerCaseFilter(matchVersion, result);			
		        ShingleFilter shingle = new ShingleFilter(result, 2); //NOTE: asking for bi-gram
		        shingle.setOutputUnigrams(true);
			result = new StopFilter(matchVersion, shingle, getStopwords());
			return new TokenStreamComponents(source, result);
		}
	};

Now let’s call them:

		String text = "This is the best, buy it.";

		TokenStream stream = ANALYZER_NGRAM.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("ANALYZER_NGRAM:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

		stream = WV_ANALYZER.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("WV_ANALYZER:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

Results are:

ANALYZER_NGRAM:
this is
is the
the best
best buy
buy
buy it
WV_ANALYZER:
Th
hi
is
s 
 i
is
s 
 t
th
he
e 
 b
be
es
st
t,
, 
 b
bu
uy
y 
 i
it
t.

Is Legal Industry Ready for Big Data? (II)

November 20th, 2013

Think more. And maybe we (software guys) should think in lawyer’s ways. Here is my speculation from solo or small law practice firm perspective.

What keep me from forming my own law practice firm?

First of all, the cost associated with opening one’s own firm. Such as, administration stuff, email, fax, phone, .. office administration. Very up to date log of ongoing cases. So far, we are talking about both hardwares and human investment.

Second. It is hard for solos to get clients. How to such lawyers marketing themselves? After all, it could be expensive to get the message out, especially, lawyers are particularly good at marketing, well, not blame them though. I’d rather my lawyer a terrific lawyer not just a so so lawyer with great marketing talent.

Third. Law practice depends on quite heavily on experience and knowledge. How to provide better service is a question and as well as  a challenge? Because after all, for a client, it is about one thing–winning. If we know there is only one person who could help winning a case, every calculation would be different. A bigger firm presumably has lots of talent pool to help on cases. Why should a client choose a smaller firm or solo practicer if the perception is insupprior?

Last but not least. What’s the pricing position for small firm or solo? In general, it should be lower than the big firms, but if the case is small, how much lower can you go? Presumably, the cases they get are very likely small. The margin of profit at the end of day might be very small.

This is my small survey type of conversation with different people and so far it pose a daunting task for software. But it is not impossible to crank the legal industry from other perspective/targeted users though. Even just focus on those concerns, one could make some software products to contribute. Anyway, it’s worth taking those concerns into consideration in advance.

Is Legal Industry Ready for Big Data?

October 28th, 2013

Although legal industry has been stubbornly resisted to IT, they maybe more eager to join in the big data world. Especially, the advanced analytics tools driven by big data can be  applied to legal data and transform the legal analysis. After all, two of the goals in big data analytics are about discovering patterns in a more holistic view, and to make reliable and actionable predictions. Because of the availability of large amount of data, a specifically suited algorithm could help to reach these two goals. On the other hand, some of the goals in legal analysis are very similar to the ones of big data analysis.

But as a biased machine learning person, the question I have is, is the legal industry ready for big data? Or what are the reasons why big data can’t help the legal industry?

A small OpenMP practice

October 21st, 2013

I had to parallelize a piece of code. The requirements were, single node with multiple cores, RAM might be limited based on data sets. OpenMP seems a good fit for my purpose, here is my solution:

	
	#pragma omp parallel num_threads(MAX_THREADS_NUMBER)
	{
		int total_threads = omp_get_num_threads();
		int ID = omp_get_thread_num();
		int block = (int) N / (double) total_threads;
		int start = ID * block;
		int end = (ID + 1) * block; 
		if (ID == (total_threads - 1)) {end = N;}

		VpTree* tree = new VpTree(X);

		for(int n = start; n< end; n++) {
		        search(n, tree);
		}
		delete treeC;
	}

At beginning, the start and end for the loop segment are manually defined. Well, this is not necessary in some cases since one could use

	#pragma omp parallel for
	for(int n = 0; n< N; n++) {
	        search(n, tree);
	}

But the problem was that, in “search()” there are a bit messy/complicated/un-neat reading of the pointer tree. So I always ended up “Segmentation fault”. That’s why I was duplicating the pointer “tree” for each thread. The max number of threads allowed is pre-defined in order to comply with RAM issues. There might be more elegant solution, but I’m easy to satisfy.

Last but not least, I benefited a lot from the presentation of Tim Mattson and Larry Meadows from Intel: A “Hands-on” Introduction to OpenMP.

KDD 2013 Day 3

September 6th, 2013

The keynote today is Andrew Ng on his opening learning website, Couresa. I really admire his mission, which is to provide education to everyone on this planet for free. In my opinion, education is one key differentiator between human and other animals. We human evolved to be together and learn some language and truth about the world. As Andrew said, it is shocking to hear that some people in the world still think education is only for elites. I for one, am the beneficial of education, so probably I’m biased. I strongly encourage everyone to take a look at the website.

 

 

KDD 2013 Day 2

September 3rd, 2013

The keynote speaker today is Raghu Ramakrishnan from Microsoft. The title of his presentation is “Scale-out beyond Map-Reduce“. He presented his work on a better than MapReduce solution for machine learning. As he argued MapReduce is not meant for iterative learning at all, that pretty much means not suite for machine learning. His work is called REEF which is built on top of YARN, a resource management system. For my part, I’m very glad to hear such decisive comment on iterative learning on MapReduce. Last year, indeed I spent time on running iterative algorithms on MapReduce, which concluded same.

Kevin Bache presented an interesting research on “Text-based measures of document diversity“. He conducted a topic modeling first, and then use the word distribution vectors for topics to project document diversity score. As he said in front of his poster, one doesn’t need to stick to a particular topic model since his score function is quite generic.

 

KDD 2013 Day 1

August 21st, 2013

This year KDD is in Chicago. Day 1 is full of workshops.

One workshop is adkdd, which is fully on advertising. There was one speaker, Brian Burdick presented “Advertising – Why Human Intuition Still Exceeds Our Best Technology”. It is very impressive, though I totally disagree his point of our best technology will exceed human intuition. But that is very much another debate and pretty much depends on the definitions. Regardless. One inspiring story he made is the history of diamond sales in US. Followed his presentation, I did a bit dig. A brief is here:

Before 1950s, Dee Beers, the monopoly of raw diamond supply of the world desperately wanted to boost sales of diamond in US. Especially US still felt the impact of the Great Depression at that time, sales of diamond was very poor. At that time, people very rarely use jewelries other than diamond on the rings. Emerald was a pretty popular choice. Today a guy could be killed if he ever thinks buying an engagement ring with emerald on it. Since the antitrust law, Dee Beers was prohibited to promote itself or any jewelry in their marketing. So when Dee Beers worked with N.W. Ayer & Son, a advertising agency on using famous painters, such as Picasso  to create very beautiful, peaceful, harmony paints for advertising. But they lack an inspiring slogan goes with the paints. At that time, Frances Gerety, a young lady was looking for a job at N.W. Ayer.  It happened the copy writer just left and Ayer needed to fill the position soon. So Gerety got the job right away, though at that time females were barely given any respected jobs. It is this lady who coined “A Diamond Is Forever” in 1947 that is used in the advertising. Immediately, the slogan resonated millions lady’s heart. The sales of diamond in US went skyrocketing. In 2011, 75% of brides in US wear one diamond ring and it is $7 billion market. She had owned copy right of that sentence for 20 years.

Four words from a young lady make all the men in US who wanted to buy an emerald engagement ring extincted in last 60 years.

Brian said his slides will be online, we’ll see. Anyway, worth checking out if you could.

Query Clustering based on Bid Landscape for Sponsored Search Action Optimization” is a poster from Ye Chen at Microsoft. Based on a brief discussion with him at his poster, my understanding is that, the ad keywords biding system needs reduce the dimensionality of queries. So he clustered key words based on a Click-Through-Rate(CTR) probability profile. For example, key words with similar CTR will be grouped together and then in biding, queries of those key words will be given same Cost-Per-Click(CPC). Overall, the new design financially benefits Microsoft. The paper is very analytical and worth a reading.