Nutch – How It Works

After the installation of Nutch as described in my previous post, you can either follow this tutorial without the need of thinking, or get a sense of how Nutch actually works beforehand. I recommend doing both in parallel. And since you won’t find the latter on the Apache Nutch Website, let me help you out in this matter. The following illustration depicts the major parts as well as the workflow of a crawl:

Nutch Overview

  1. The injector takes all the URLs of the nutch.txt file and adds them to the crawldb. As a central part of Nutch, the crawldb maintains information on all known URLs (fetch schedule, fetch status, metadata, …).
  2. Based on the data of crawldb, the generator creates a fetchlist and places it in a newly created segment directory.
  3. Next, the fetcher gets the content of the URLs on the fetchlist and writes it back to the segment directory. This step usually is the most time-consuming one.
  4. Now the parser processes the content of each web page and for example omits all html tags. If the crawl functions as an update or an extension to an already existing one (e.g. depth of 3), the updater would add the new data to the crawldb as a next step.
  5. Before indexing, all the links need to be inverted, which takes into account that not the number of outgoing links of a web page is of interest, but rather the number of inbound links. This is quite similar to how Google PageRank works and is important for the scoring function. The inverted links are saved in the linkdb.
  6. and 7. Using data from all possible sources (crawldb, linkdb and segments), the indexer creates an index and saves it within the Solr directory. For indexing, the popular Lucene library is used. Now, the user can search for information regarding the crawled web pages via Solr.

In addition, filters, normalizers and plugins allow Nutch to be highly modular, flexible and very customizable throughout the whole process. This aspect is also pointed out in the picture above. I hope, this post helped you getting a basic understanding of the inner workings of Nutch.

9 comments on Nutch – How It Works

  1. Good but give some details about output from nutch. How the output would look’s like.

  2. Hi Sanjay, I don’t really know what you mean by “output”. There are some logs and as explained some databases that save the parsed content (= output?) and all that data is accessible via the file system. If you mean the final output which you get from the Solr interface, then have a look at the very last “code box” in my other article “Nutch – Plugin Tutorial”. The structure of the output is defined in ./conf/schema.xml

  3. Hi Florian ..
    Thanks for Sharing …
    I’m learning to use Nutch, and show the results using Ajax and JSP,
    now i have a little Search Engine for my local system LOL …

  4. Hi Florian,

    is it possible to run nutch without storing the website content and without using an indexer?
    I’d like to use nutch as a crawler (with all advantages like pagerank, updated crawls etc.) and send the content (and some information like the url etc.) as json to kafka. In kafka I want to check the content and if appropriate save it to mongo in my own format. mongo uses ElasticSearch (via River) to index the content.
    I neither need nutch to save the content of the pages nor to index the content.
    Is there a way to to this or is nutch the wrong tool for that?

  5. Pingback: Apache Nutch | datayo

  6. Pingback: Apache Nutch | magizbox

  7. Pingback: apache nutch - دسرا

  8. Pingback: Nutch - How it works - 博弈无涯

Feedback

Your email address will not be published. Required fields are marked *

*