Archive for the ‘general’ Category

We Demand More Like Andrew Ng!

Wednesday, August 9th, 2017

Just learned Andrew Ng will start a deep learning coursera class . I’m very excited to hear it. From my shallow experience and knowledge in industry and academia, we as a healthy society need or should demand such serious education effort.

Thanks to the huge success from Fei-Fei Li’s ImageNet, the deep learning that Jeff Hinton visioned and started, and carried on by Joshua Bengio and Yann LeCunn, made household known by DeepMind’s AlphaGo. Now almost everywhere people talk about machine learning, artificial intelligence, deep learning, etc. In particular, industry across fields has been embracing such technology like never before. All the sudden, small startups of a few people with a few know-hows get acquired, the speed of such acquisition is also unprecedented.  Money, vanity, etc fly high like all the old shows/bubbles we have seen in the past decades. Yet, I don’t think it is a bubble. Deep learning is for real and will revolutionize many industries, unleash human potential just like what electricity did on our humanity. All in all, there are a few cloud that could upend such prosperity.

The first is there is huge disparity between what industries need and what the talent could offer. Like Thomas Freeman once said(loosely cite here), breakthrough is what desperately needed meet all the sudden available. The human society need a long overdue efficiency upgrade, especially after internet. In 2012, deep learning came out seemingly nowhere. All the sudden many things seemingly only human can solve were solved by this mysterious thing called deep learning. The talents became the blood/resource that everybody want. Yet, in large part, the whole academia world wasn’t there for the nurturing and birth of deep learning of course can’t produce enough such talents. But demands just keep piling up. So we end up with lots of expedite “talents”. Among few real talents, there are a lot of less well-trained, who were exposed to too cozy environment/problems. Undoubtedly many of them will not be able to live up the industry demands in many ways. Sooner or later, disappointments will be everyone’s focus.

The second is that academia really don’t have time to prepare the whole field with strong and solid theory foundation. We still don’t know why deep learning works, there are some educated guesses, but a clear explanation is absent. I haven’t seen any great advancement sustain ups and downs with no mathematics foundation.

I should stop here, too much nagging. What I want to say is that we as a society should demand more people like Andrew Ng and even beyond. Things happen for a reason and exist for reasons too, our society need demand more to advance humanity.

nvidia_346_uvm error in Caffe on AWS g2 instance

Tuesday, July 26th, 2016

I got the following error after rebooted my G2 instance. The code is in Python and prior to the reboot, it works fine.

modprobe: ERROR: ../libkmod/libkmod-module.c:809 kmod_module_insert_module() could not find module by name=’nvidia_346_uvm’

modprobe: ERROR: could not insert ‘nvidia_346_uvm’: Function not implemented

I googled around and found no direct solution so took a stab myself by doing:

sudo apt-get remove nvidia-346-uvm

and then reboot.

Surprise. It works!

Still don’t know why it works.

Simple But Surprisingly Good Clustering “Algorithm”

Sunday, January 17th, 2016

This is one of those jaw dropping papers. Density Clustering, astonishingly simple and yet phonemically performance. No need to put it into math and hardly call it an “algorithm” (truly an algorithm).  Here is how it works:

Given a distance/similarity matrix (pairwise distance/similarity for all data points) and a cutoff distance/similarity.

  1. For every point, count how many other points are within the cutoff distance. The count is the density of the current point.
  2. For every point, find all other points having a higher density. Among those with higher density, find the smallest distance, and use it as the current point’s distance.
  3. Plot density vs distance.
  4. All the outliers in the plot are cluster centers.
  5. Assign each point the cluster of its nearest neighbor.

No any fancy math, and it seems work. It has R library too. Well done.

Spark Error: Too many open files

Monday, December 22nd, 2014

This is a typical Spark error happens on Ubuntu (probably other Linux versions too). To resolve it, one could do the following:

Change this file /etc/security/limits.conf

to add:

* soft nofile 55000
* hard nofile 55000

55000 is the one I use for example, you could choose larger or smaller number, it means, we allow the system to handle open as many as 55000 files.

After save the changes, you will need to REBOOT to make it effective.

One note is that, I recommend don’t go crazy on this number, for example, I once put it 1,000,000 and Spark generated too many temporary files and caused my harddisk very hard time deleting those temporary files.

Python+Redis

Wednesday, June 18th, 2014

 

 

 

I was using the dictionary function in Python to manage my database, which turned out a disaster. Well, I shouldn’t have tried it in the first place. It was painfully slow if more than 10,000 data entries. Redis, on the other hand is an in-memory database and gains flame quite dramatically and quickly among tech followers. So I gave it a try. Here is the time consume on my Mac.
RedisPerformanceThe x-axis is the number of data entries. The y-axis is time consumed by redis+python to store those data entries, its unit is second.

For the record, I indeed managed using dictionary() to store about 10,000 data entries. But after waited so long, I decided to abandon it entirely. It is safe to say it is out of this chart.

 

Process the wikipedia dump data

Tuesday, May 6th, 2014

The entire wikipedia data can be downloaded from here.

In order to get the articles, one way is to use the wikiprep code, which is written in Perl, my ex-favorite language. I ran into problems when tried to run it after installation. For example, when ran wikiprep, the output on screen is:

Can’t locate Log/Handler.pm in @INC (@INC contains: /Library/Perl/5.16/darwin-thread-multi-2level /Library/Perl/5.16 /Network/Library/Perl/5.16/darwin-thread-multi-2level /Network/Library/Perl/5.16 /Library/Perl/Updates/5.16.2/darwin-thread-multi-2level /Library/Perl/Updates/5.16.2 /System/Library/Perl/5.16/darwin-thread-multi-2level /System/Library/Perl/5.16 /System/Library/Perl/Extras/5.16/darwin-thread-multi-2level /System/Library/Perl/Extras/5.16 .) at /usr/local/bin/wikiprep line 40.

BEGIN failed–compilation aborted at /usr/local/bin/wikiprep line 40.

To solve this problem, after several tries and errors and Google searches, the solution is to install whatever missed module, here is “Log::Handler”. So I ran

sudo cpanm Log::Handler

Well, a note is that I installed cpanm already. And using cpanm installing the missed module made the problem goes away and now I’m running wikiprep to get the actual articles out of the dump with such a command:

wikiprep -format composite -compress -f ../enwiki-20140402-pages-articles-multistream.xml.bz2 > out

Port binding in DigitalOcean Ubuntu

Sunday, February 23rd, 2014

I had trouble with binding port 80 in DigitalOcean Ubuntu and was rescued by this page on stackoverflow. Following the page, one needs is :

sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000

Then edit the file "/etc/rc.local”, notice that it is NOT other file suggested on the webpage. Editing means add the above command with small modification:

iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3000

So far, it solved my problem and everything is working.

 

Limit RAM size used by redis

Tuesday, February 18th, 2014

If you want to limit the max size of RAM that is allocated to redis, you need change the file named “redis.conf” under the path where redis is installed. The trick is that when you run “./redis-server”, you have to run it from where the “redis.conf” file is. Unless you specifically call that modified “redis.conf” when start “redis-server”, it won’t use the modified file at all. What I did, for example, allocate only 100MB RAM to redis:

Just add one line: “maxmemory 100M” into “redis.conf”. Then start, “./src/redis-server”. It should work. But be aware of possible consequences of such restriction, “out of memory error” might come uninvitedly…

 

Comparing PaaS, Heroku vs Elastic Beanstalk vs Linode vs DigitalOcean?

Monday, February 17th, 2014

This part, I really did my homework after big time spent on try and error. Google PaaS doesn’t support Java, so sorry, out of the list.

The free tier of Heruko and Elastic Beanstalk. It works great if you app is relatively small, that is, 300MB for Heroku and 512MB for T1.micro for EB. If files larger than those two sizes, you simply won’t be able to submit. When I was blindly trying to fix my upload problem, there were no obvious statement about the two size anywhere ( as least where I looked.) In EB, the T1.micro will allocate at max 650MB RAM. If more, you will need upgrade for fee version.

MEAN infrastructure. They both have excellent support for these, in particular Heroku, users generally don’t need use npm install at all. They also have so many integrated libraries available thru their web. You could simply click on “add”. Also, Redis is available on Heroku as well. However, I didn’t figured out how to ssh to change anything on the server. In contrast, EB doesn’t have Redis at all, only PostgresDB and Dynamo for DB. The good thing is that it allows you to connect to the server via ssh. And so you could do more exotic things, hopefully. One trivial but definitely very time-consuming to figure out is that, you need zip your entire app with package.json and app.js in the root directory. If your zip will create a your subdirectory, you will never be able to see your app up (at least up until now). You will be able to see your app at /var/local/current. Again, don’t ever zip your dir, zip all files and subdir instead.

I didn’t explore the price tags in Heroku, but EB/AWS is not cheap at all, in particular, compared to Linode.com and digitalocean.com. I settled down on digitalocean.com, because it offers 2GB RAM + 40GB SSD for $20/month. Their tech support is more than excellent. Any questions so far were answered within a few hours, some within 30 minutes. The nice thing with digitalocean is that they provide MEAN as well. The catch is that something you will be your own, for example, DNS binding you need figure out yourself. They do provide excellent doc for help though.

 

Lucene Ngramtokenizer

Wednesday, February 12th, 2014

From what I found, NgramTokenizer() in lucene is really a character tokenizer. If a gram is a word, then you need ShingleFilter(). And the result of ShingleFilter() will have following behavior:

InputText = “This is the best, buy it.”

For bigram, you will have “best buy” as a bigram.

Here is the demo of how I tested.

NgramTokenizer()

	private static Analyzer WV_ANALYZER = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new NGramTokenizer(reader, 1, 2);
			TokenStream result = new StandardFilter(matchVersion, source);
			return new TokenStreamComponents(source, result);
		}
	};

That’s the definition for the Analyzer using NgramTokenizer().

ShingleFilter()

	private static Analyzer ANALYZER_NGRAM = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new LDATokenizer(matchVersion, reader);
			TokenStream result = new StandardFilter(matchVersion, source);
			result = new LowerCaseFilter(matchVersion, result);			
		        ShingleFilter shingle = new ShingleFilter(result, 2); //NOTE: asking for bi-gram
		        shingle.setOutputUnigrams(true);
			result = new StopFilter(matchVersion, shingle, getStopwords());
			return new TokenStreamComponents(source, result);
		}
	};

Now let’s call them:

		String text = "This is the best, buy it.";

		TokenStream stream = ANALYZER_NGRAM.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("ANALYZER_NGRAM:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

		stream = WV_ANALYZER.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("WV_ANALYZER:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

Results are:

ANALYZER_NGRAM:
this is
is the
the best
best buy
buy
buy it
WV_ANALYZER:
Th
hi
is
s 
 i
is
s 
 t
th
he
e 
 b
be
es
st
t,
, 
 b
bu
uy
y 
 i
it
t.