This week I setup a single-node Hadoop cluster in the lab. I used Ubuntu 12.04LTS running in a VMware 5.0 VM. The frustrating part of the install and configuration was the lack of good documentation. In this space, I am not going to create an exhaustive recipe for the installation and configuration because it would soon be out of date (as is one of my sources…). Instead I will share the 3 online sources that I used bits and pieces of to finally get it done.
The Apache instructions, above, are incomplete. The authors of the documentation appear to make assumptions about what we know. In other words, the document leaves out many details that, if not known, will prevent you from being successful. I don’t see how a novice could possibly succeed using these instructions.
Michael Noll’s tutorial, above, is excellent. He takes you step-by-step through the process, explaining things along the way. The problem is that Mr. Noll wrote based on an older version of hadoop. A small but significant portion of the instructions are incorrect for the latest versions of hadoop.
Using the first two sources I was able to get the hadoop cluster running, but when I submitted my first map-reduce job using mrjob, I ran into errors. The Stack Overflow answer, link below, contains a description of the error I saw *and* describes a fix. The fix updates some of the out-of-date configuration information that Michael Noll’s otherwise excellent tutorial gets wrong.
Now I am able to run map-reduce python scripts based on the mrjob package. They are massively SLOW. Perhaps that will be the subject of another post.