Archive for the ‘general’ Category

Comparing PaaS, Heroku vs Elastic Beanstalk vs Linode vs DigitalOcean?

Monday, February 17th, 2014

This part, I really did my homework after big time spent on try and error. Google PaaS doesn’t support Java, so sorry, out of the list.

The free tier of Heruko and Elastic Beanstalk. It works great if you app is relatively small, that is, 300MB for Heroku and 512MB for T1.micro for EB. If files larger than those two sizes, you simply won’t be able to submit. When I was blindly trying to fix my upload problem, there were no obvious statement about the two size anywhere ( as least where I looked.) In EB, the T1.micro will allocate at max 650MB RAM. If more, you will need upgrade for fee version.

MEAN infrastructure. They both have excellent support for these, in particular Heroku, users generally don’t need use npm install at all. They also have so many integrated libraries available thru their web. You could simply click on “add”. Also, Redis is available on Heroku as well. However, I didn’t figured out how to ssh to change anything on the server. In contrast, EB doesn’t have Redis at all, only PostgresDB and Dynamo for DB. The good thing is that it allows you to connect to the server via ssh. And so you could do more exotic things, hopefully. One trivial but definitely very time-consuming to figure out is that, you need zip your entire app with package.json and app.js in the root directory. If your zip will create a your subdirectory, you will never be able to see your app up (at least up until now). You will be able to see your app at /var/local/current. Again, don’t ever zip your dir, zip all files and subdir instead.

I didn’t explore the price tags in Heroku, but EB/AWS is not cheap at all, in particular, compared to Linode.com and digitalocean.com. I settled down on digitalocean.com, because it offers 2GB RAM + 40GB SSD for $20/month. Their tech support is more than excellent. Any questions so far were answered within a few hours, some within 30 minutes. The nice thing with digitalocean is that they provide MEAN as well. The catch is that something you will be your own, for example, DNS binding you need figure out yourself. They do provide excellent doc for help though.

 

Lucene Ngramtokenizer

Wednesday, February 12th, 2014

From what I found, NgramTokenizer() in lucene is really a character tokenizer. If a gram is a word, then you need ShingleFilter(). And the result of ShingleFilter() will have following behavior:

InputText = “This is the best, buy it.”

For bigram, you will have “best buy” as a bigram.

Here is the demo of how I tested.

NgramTokenizer()

	private static Analyzer WV_ANALYZER = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new NGramTokenizer(reader, 1, 2);
			TokenStream result = new StandardFilter(matchVersion, source);
			return new TokenStreamComponents(source, result);
		}
	};

That’s the definition for the Analyzer using NgramTokenizer().

ShingleFilter()

	private static Analyzer ANALYZER_NGRAM = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new LDATokenizer(matchVersion, reader);
			TokenStream result = new StandardFilter(matchVersion, source);
			result = new LowerCaseFilter(matchVersion, result);			
		        ShingleFilter shingle = new ShingleFilter(result, 2); //NOTE: asking for bi-gram
		        shingle.setOutputUnigrams(true);
			result = new StopFilter(matchVersion, shingle, getStopwords());
			return new TokenStreamComponents(source, result);
		}
	};

Now let’s call them:

		String text = "This is the best, buy it.";

		TokenStream stream = ANALYZER_NGRAM.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("ANALYZER_NGRAM:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

		stream = WV_ANALYZER.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("WV_ANALYZER:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

Results are:

ANALYZER_NGRAM:
this is
is the
the best
best buy
buy
buy it
WV_ANALYZER:
Th
hi
is
s 
 i
is
s 
 t
th
he
e 
 b
be
es
st
t,
, 
 b
bu
uy
y 
 i
it
t.

Is Legal Industry Ready for Big Data? (II)

Wednesday, November 20th, 2013

Think more. And maybe we (software guys) should think in lawyer’s ways. Here is my speculation from solo or small law practice firm perspective.

What keep me from forming my own law practice firm?

First of all, the cost associated with opening one’s own firm. Such as, administration stuff, email, fax, phone, .. office administration. Very up to date log of ongoing cases. So far, we are talking about both hardwares and human investment.

Second. It is hard for solos to get clients. How to such lawyers marketing themselves? After all, it could be expensive to get the message out, especially, lawyers are particularly good at marketing, well, not blame them though. I’d rather my lawyer a terrific lawyer not just a so so lawyer with great marketing talent.

Third. Law practice depends on quite heavily on experience and knowledge. How to provide better service is a question and as well as  a challenge? Because after all, for a client, it is about one thing–winning. If we know there is only one person who could help winning a case, every calculation would be different. A bigger firm presumably has lots of talent pool to help on cases. Why should a client choose a smaller firm or solo practicer if the perception is insupprior?

Last but not least. What’s the pricing position for small firm or solo? In general, it should be lower than the big firms, but if the case is small, how much lower can you go? Presumably, the cases they get are very likely small. The margin of profit at the end of day might be very small.

This is my small survey type of conversation with different people and so far it pose a daunting task for software. But it is not impossible to crank the legal industry from other perspective/targeted users though. Even just focus on those concerns, one could make some software products to contribute. Anyway, it’s worth taking those concerns into consideration in advance.

Is Legal Industry Ready for Big Data?

Monday, October 28th, 2013

Although legal industry has been stubbornly resisted to IT, they maybe more eager to join in the big data world. Especially, the advanced analytics tools driven by big data can be  applied to legal data and transform the legal analysis. After all, two of the goals in big data analytics are about discovering patterns in a more holistic view, and to make reliable and actionable predictions. Because of the availability of large amount of data, a specifically suited algorithm could help to reach these two goals. On the other hand, some of the goals in legal analysis are very similar to the ones of big data analysis.

But as a biased machine learning person, the question I have is, is the legal industry ready for big data? Or what are the reasons why big data can’t help the legal industry?

KDD 2013 Day 3

Friday, September 6th, 2013

The keynote today is Andrew Ng on his opening learning website, Couresa. I really admire his mission, which is to provide education to everyone on this planet for free. In my opinion, education is one key differentiator between human and other animals. We human evolved to be together and learn some language and truth about the world. As Andrew said, it is shocking to hear that some people in the world still think education is only for elites. I for one, am the beneficial of education, so probably I’m biased. I strongly encourage everyone to take a look at the website.

 

 

KDD 2013 Day 2

Tuesday, September 3rd, 2013

The keynote speaker today is Raghu Ramakrishnan from Microsoft. The title of his presentation is “Scale-out beyond Map-Reduce“. He presented his work on a better than MapReduce solution for machine learning. As he argued MapReduce is not meant for iterative learning at all, that pretty much means not suite for machine learning. His work is called REEF which is built on top of YARN, a resource management system. For my part, I’m very glad to hear such decisive comment on iterative learning on MapReduce. Last year, indeed I spent time on running iterative algorithms on MapReduce, which concluded same.

Kevin Bache presented an interesting research on “Text-based measures of document diversity“. He conducted a topic modeling first, and then use the word distribution vectors for topics to project document diversity score. As he said in front of his poster, one doesn’t need to stick to a particular topic model since his score function is quite generic.

 

KDD 2013 Day 1

Wednesday, August 21st, 2013

This year KDD is in Chicago. Day 1 is full of workshops.

One workshop is adkdd, which is fully on advertising. There was one speaker, Brian Burdick presented “Advertising – Why Human Intuition Still Exceeds Our Best Technology”. It is very impressive, though I totally disagree his point of our best technology will exceed human intuition. But that is very much another debate and pretty much depends on the definitions. Regardless. One inspiring story he made is the history of diamond sales in US. Followed his presentation, I did a bit dig. A brief is here:

Before 1950s, Dee Beers, the monopoly of raw diamond supply of the world desperately wanted to boost sales of diamond in US. Especially US still felt the impact of the Great Depression at that time, sales of diamond was very poor. At that time, people very rarely use jewelries other than diamond on the rings. Emerald was a pretty popular choice. Today a guy could be killed if he ever thinks buying an engagement ring with emerald on it. Since the antitrust law, Dee Beers was prohibited to promote itself or any jewelry in their marketing. So when Dee Beers worked with N.W. Ayer & Son, a advertising agency on using famous painters, such as Picasso  to create very beautiful, peaceful, harmony paints for advertising. But they lack an inspiring slogan goes with the paints. At that time, Frances Gerety, a young lady was looking for a job at N.W. Ayer.  It happened the copy writer just left and Ayer needed to fill the position soon. So Gerety got the job right away, though at that time females were barely given any respected jobs. It is this lady who coined “A Diamond Is Forever” in 1947 that is used in the advertising. Immediately, the slogan resonated millions lady’s heart. The sales of diamond in US went skyrocketing. In 2011, 75% of brides in US wear one diamond ring and it is $7 billion market. She had owned copy right of that sentence for 20 years.

Four words from a young lady make all the men in US who wanted to buy an emerald engagement ring extincted in last 60 years.

Brian said his slides will be online, we’ll see. Anyway, worth checking out if you could.

Query Clustering based on Bid Landscape for Sponsored Search Action Optimization” is a poster from Ye Chen at Microsoft. Based on a brief discussion with him at his poster, my understanding is that, the ad keywords biding system needs reduce the dimensionality of queries. So he clustered key words based on a Click-Through-Rate(CTR) probability profile. For example, key words with similar CTR will be grouped together and then in biding, queries of those key words will be given same Cost-Per-Click(CPC). Overall, the new design financially benefits Microsoft. The paper is very analytical and worth a reading.

Attending ICLR 2013 (Day 3)

Monday, May 6th, 2013

It is Saturday and yes, the conference is going on. It is light and no posters though.

Herded Gibbs Sampling by Luke Bornn from Harvard is very promising. The Gibbs sampling they proposed can achieve convergence rate at O(1/T). He showed the results from their Herded Gibbs sampling and regular Gibbs sampling, in terms of accuracy, they both reach similar level, but the Herded Gibbs sampling is much faster. Worth a try if applicable.

The talk Feature Learning in Deep Neural Networks – A Study on Speech Recognition Tasks by Dong Yu from Microsoft yesterday is also very impressive. He showed that deep networks indeed help quite dramatically in speech recognition.

The Manifold of Human Emotions by Seungyeon Kim from Georgia Tech is very interesting. He used review data to define 32 emotions and then with very intuitive assumptions, he was able to find the manifold of the 32 emotions. I appreciate his passionate and detailed explanation of his work.

Overall, my feeling is that deep learning indeed works exceedingly well on one hand, if judged only by performance. On the other hand, we really don’t know why it works, or, I mean, what knowledge can be gained exclusively. Anyway, better is better, nobody wants their products impress users with more falses. The product is a blackbox to users anyway.

Attending ICLR 2013 (Day 2)

Friday, May 3rd, 2013

The talks today are very close to me overall. For example, A Nested HDP for Hierarchical Topic Models by John Paisley is very interesting approach. It is a hierarchical LDA for topic discoveries, it is a combination of Chinese restaurant processing and LDA. John demonstrated his nested HDP in more than 1 million NY Times articles. He said that it took about 350 iterations in about 10 hours to converge, which is very appealing given that I assume he didn’t use any parallel computing yet.

Zero-Shot Learning Through Cross-Modal Transfer by Richard Socher from Stanford is another interesting one. This is a very appealing concept, Richard showed that he could classify unseen photos without labeled data. Although it is on both text and photo data, one obviously could try it out on text only data. Definitely worth reading more.

Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov from Google is very exciting. They found a word representation, unlike traditional tf or tf-idf vector, they adopted a continuous space so that semantically similar words reserve the distance. He mentioned that he will make the code and the representation available soon after approved. I personally am very interested in the one he got on 100B data.

Learning New Facts From Knowledge Bases With Neural Tensor Networks and Semantic Word Vectors by Danqi Chen from Stanford is a very ambitious project, especially, this is just her first year at graduate school. In wordnet or other similar ones, they have relationship between entities, or, ontology. We know those ontology is not complete, or near good enough. Her goal is to learn a model that can predict missed entities for a relation, as well as predict a missed relation for entities. The results are fairly good given the daunting difficulty of the goal.

Attending ICLR 2013 (Day 1)

Thursday, May 2nd, 2013

This is my first time attending the International Conference on Learning Representation 2013. It is very revolutionarily cool that this conference invent (or adopts) an open-review system, so that all the accepted papers are online already. I for one very much support such review system, at least for conference publications.

So far, the majority of talks focusing on deep learning in image processing. But as some of them showed, the very same technique can be applied to documents as well. For example, the invited talk from Dr. Ruslan Salakhutdinov. He presented the Deep Boltzmann Machines, and its applications. One application is in image and texts analysis. For example, he built a model to learn from tags of photos, and then the trained model can be used to do either image or text retrieval, which definitely have lots of business interests. The other application is indeed on topical learning. He briefly said the results is not very good, but one interesting point he made is that it indeed has some advantages over LDA.

Another interesting work is from Judy Hoffman. The title of her talk is “Efficient Learning of Domain-invariant Image Representations“. One of the points she made is that there is variant between training data and new data. Once a model is trained, one way to incorporate the new varied data is to do transformation on the new data so that it still be captured by the trained model. I found it is cool and could be used on documents as well. But does such a global transformation exist given the nature of very varied data?

Complexity of Representation and Inference in Compositional Models with Part Sharing by Alan Yuille is quite interesting, Alan is clearly famous of not only his PhD advisor, Stephen Hawking but also his brilliant works. I’m very much impressed with his quote (maybe he quoted someone else) “The world is compositional or God exists”. His is idea is very useful for image processing and document analysis as well.