Nutch – Installation

Nutch is a flexible and powerful open source tool for web crawling, developed by the Apache Software Foundation and its community. It builds on Apache Solr and comes with an integration of the highly popular Apache Hadoop, which actually started out as a subproject of Nutch. Nowadays Nutch is widely-used and probably the most popular tool in its niche. However, the software does not come with a graphical user interface, which makes it difficult for new users to get their feet wet. Hence, this post is geared towards facilitating the first steps by describing my experiences with the software. All the information is based on Windows 7 as the underlying operating system and on Apache Nutch 1.4.

Installation

I found it to be very useful to get Nutch up and running in Eclipse. Not only does it tremendously facilitate debugging and allow for faster development, but it also enhances the user experience since you don’t need to switch between the file system, cygwin and several command-line interfaces. This brings us to the first part of this section:

  1. Start by strictly following the instructions of how to run Nutch in Eclipse.

Set aside your ego! I know, you never follow tutorials entirely since almost always there is a faster way of doing the same. This time, however, it is absolutely necessary to follow the instructions! And before digging through the tutorial, read these useful comments:

  • Downloading the newest Java version (jdk and jre!) might avoid getting unexpected errors. You can find the software here.
  • For Subclipse to run, I had to get rid of a couple of version mismatches. In case you want to avoid that hassle, simply don’t install Subclipse. Instead you can just download the binary of Nutch and specify its location when creating a new Java project in Eclipse (uncheck “use default location” and point to the Nutch directory). Keep in mind though, that some instructions of the Wiki-page above might not be 100% correct anymore (e.g. jars might already be added).
  • Do not worry about the “Do not build Nutch now.” instruction of the tutorial when you notice the building process that Eclipse starts automatically after creating a new Java project. As long as you don’t manually run build.xml, you’re fine.
  • After the instruction “Ensure that you’re in the Package Explorer > right click on Trunk Project folder.” you need to click on “build path” > “configure build path” > “folders”
  • In case you face an error during the installation process, take a look at section number 2 of this chapter.

Throughout the installation process, I faced a couple of errors wherefore I couldn’t find a solution online. Hence, the second part of the installation section deals with:

  1. If things do not work…
  • analyze the file hadoop.log for debugging (can be found at nutch/log or at nutch/runtime/local/log).
    It is more verbose than the console output.
  • “plugin.folders” value:
    I have no idea why the tutorial tells you to “ensure that you change the property “plugin.folders” to “./src/plugin” on $NUTCH_HOME/conf/nutch-site.xml”. In my experience leaving the default untouched (“plugins”) works just fine.
  • Missing or wrong dependencies within the project:
    Usually some jars are missing. In my case I had to download 3 additional jars and add them to the buildpath (right click on project > build path > configure build path > Libraries > Add JARs): nekohtml, tagsoup and springsource.com.sun.syndication
  • Error Messages when trying to build Nutch:
    – “Unable to find a javac compiler; com.sun.tools.javac.Main is not on the classpath. Perhaps JAVA_HOME does not point to the JDK.”:
    What solved it for me was to add JAVA_HOME to the classpath which points to ./java/jdk1.7.0_02 and restart Windows afterwards.
    – the very first time you build Nutch, it needs to download files from the Internet. In case you use a proxy and didn’t tell Eclipse about it, you will get an error. The solution is to either try to call Eclipse’s attention to the proxy (which is somewhat challenging), or just build Nutch once you’re not using a proxy (which is the easier way). All subsequent builds won’t need to download files.
  • “Input path does not exist”:
    open the Run Configuration for Nutch, go to Arguments, select Working Directory: “Other” and enter ${workspace_loc:Nutch/runtime/local}
  • When trying to run the first crawl results in -> java.io.IOException: Failed to set permissions of path: \tmp\hadoop-… :
    Download and install Cygwin and add it to your PATH environment variable (e.g. C\cygwin\bin). If that does not do the trick, then try downgrading hadoop to a previous version like 0.20.0. You can find instructions on how to do that here.
  • for all other errors take a look at the RunNutchInEclipse-Tutorial and ErrorMessages-in-Nutch, search online or make use of the mailing list for users (this should be the very last option though).

Overall, the installation of Nutch in Eclipse sometimes turns more into an art than one would expect. However, once it works and you can play around and crawl some web pages, at that point I’m sure you will have fun with the software! Feel free to take a look at my next article about Nutch, which provides a short overview of how Nutch works.

Share your love

7 Comments

  1. This post helped me a lot in installing and running apache nutch. It’s crisp and clear.

    I am facing a problem in crawling website.

    my regex_urlfilter.txt had

    +^http://([a-z0-9]*\.)*abc.org/science/physics/

    later I changed to

    +^http://([a-z0-9]*\.)*abc.org/([a-zA-Z0-9\-\_/])*/v/

    Now, when I run crawling it is giving the following message.

    Generator: 0 records selected for fetching, exiting …
    Stopping at depth=0 – no more URLs to fetch.
    No URLs to fetch – check your seed list and URL filters.
    crawl finished: crawl-20121017164503

    In other words, nutch is not allowing me to crawl again when I am using the same or part of URL.

    Please help me in solving the issue.

    Thank you in advance.

  2. Hi Swamy, how does your seed list look like? If no URL in there matches with your second regex, then no URLs will be selected for fetching and you will see what you explained as a result. If that is not the reason, then go check all your URL filter files as well as all the URL normalizer files (maybe Nutch thinks that the URL got already crawled as it is normalized to one that you already crawled) in detail. Also, start a new “database” folder for every new run in case you don’t want to add new URLs to the already existing database. Best, Florian

  3. Hi Florian,
    so ein hilfreiches Werkzeug bräuchte ich für ein Experiment für meine Masterarbeit, allerdings habe ich im Bereich Informatik gar keine Ausbildung. Gibt es etwas Analoges zu Nutch, das auf Mac läuft?
    Hoffnungsvolle Grüße
    Letitia

  4. Thanks for pointing that out, Tejas…
    please notice, that I don’t work with Nutch anymore and therefore am not updating the information. Although the link “how to run Nutch in Eclipse” still points to the correct web page, my instructions might be deprecated.

Leave a Reply

Your email address will not be published. Required fields are marked *