Archive for May, 2014

Process the wikipedia dump data

Tuesday, May 6th, 2014

The entire wikipedia data can be downloaded from here.

In order to get the articles, one way is to use the wikiprep code, which is written in Perl, my ex-favorite language. I ran into problems when tried to run it after installation. For example, when ran wikiprep, the output on screen is:

Can’t locate Log/ in @INC (@INC contains: /Library/Perl/5.16/darwin-thread-multi-2level /Library/Perl/5.16 /Network/Library/Perl/5.16/darwin-thread-multi-2level /Network/Library/Perl/5.16 /Library/Perl/Updates/5.16.2/darwin-thread-multi-2level /Library/Perl/Updates/5.16.2 /System/Library/Perl/5.16/darwin-thread-multi-2level /System/Library/Perl/5.16 /System/Library/Perl/Extras/5.16/darwin-thread-multi-2level /System/Library/Perl/Extras/5.16 .) at /usr/local/bin/wikiprep line 40.

BEGIN failed–compilation aborted at /usr/local/bin/wikiprep line 40.

To solve this problem, after several tries and errors and Google searches, the solution is to install whatever missed module, here is “Log::Handler”. So I ran

sudo cpanm Log::Handler

Well, a note is that I installed cpanm already. And using cpanm installing the missed module made the problem goes away and now I’m running wikiprep to get the actual articles out of the dump with such a command:

wikiprep -format composite -compress -f ../enwiki-20140402-pages-articles-multistream.xml.bz2 > out