Nutch – How It Works

After the installation of Nutch as described in my previous post, you can either follow this tutorial without the need of thinking, or get a sense of how Nutch actually works beforehand. I recommend doing both in parallel. And since you won’t find the latter on the Apache Nutch Website, let me help you out in this matter. The following illustration depicts the major parts as well as the workflow of a crawl:

Nutch Overview

  1. The injector takes all the URLs of the nutch.txt file and adds them to the crawldb. As a central part of Nutch, the crawldb maintains information on all known URLs (fetch schedule, fetch status, metadata, …).
  2. Based on the data of crawldb, the generator creates a fetchlist and places it in a newly created segment directory.
  3. Next, the fetcher gets the content of the URLs on the fetchlist and writes it back to the segment directory. This step usually is the most time-consuming one.
  4. Now the parser processes the content of each web page and for example omits all html tags. If the crawl functions as an update or an extension to an already existing one (e.g. depth of 3), the updater would add the new data to the crawldb as a next step.
  5. Before indexing, all the links need to be inverted, which takes into account that not the number of outgoing links of a web page is of interest, but rather the number of inbound links. This is quite similar to how Google PageRank works and is important for the scoring function. The inverted links are saved in the linkdb.
  6. and 7. Using data from all possible sources (crawldb, linkdb and segments), the indexer creates an index and saves it within the Solr directory. For indexing, the popular Lucene library is used. Now, the user can search for information regarding the crawled web pages via Solr.

In addition, filters, normalizers and plugins allow Nutch to be highly modular, flexible and very customizable throughout the whole process. This aspect is also pointed out in the picture above. I hope, this post helped you getting a basic understanding of the inner workings of Nutch.

Share your love

13 Comments

  1. Good but give some details about output from nutch. How the output would look’s like.

  2. Hi Sanjay, I don’t really know what you mean by “output”. There are some logs and as explained some databases that save the parsed content (= output?) and all that data is accessible via the file system. If you mean the final output which you get from the Solr interface, then have a look at the very last “code box” in my other article “Nutch – Plugin Tutorial”. The structure of the output is defined in ./conf/schema.xml

  3. Hi Florian ..
    Thanks for Sharing …
    I’m learning to use Nutch, and show the results using Ajax and JSP,
    now i have a little Search Engine for my local system LOL …

  4. Hi Florian,

    is it possible to run nutch without storing the website content and without using an indexer?
    I’d like to use nutch as a crawler (with all advantages like pagerank, updated crawls etc.) and send the content (and some information like the url etc.) as json to kafka. In kafka I want to check the content and if appropriate save it to mongo in my own format. mongo uses ElasticSearch (via River) to index the content.
    I neither need nutch to save the content of the pages nor to index the content.
    Is there a way to to this or is nutch the wrong tool for that?

  5. […] Nutch – How It Works | Florian Hartl04.03.2012 – And since you won’t find the latter on the Apache Nutch Website, let me help you out in this matter. The following illustration depicts the major … […]

  6. Hi, I’m new on Nutch and suprised that the documentation is so thin. This is the first page explaining all the steps to me. Thanks!
    BUT. If I like to use the sitemap command where does it fit in to this? Which steps are done by “sitemap”?
    Again no good info on the wiki about this…

  7. hi, i want to dump data into elasticsearch after nutch parsed the contents. I dont really want indexing, i want structured data, that i can put in es or rdbms. instead of a generalized index of the whole page. so i want the crawler+fetcher+updater+linker, but not the parser+solr+indexer part. the parser part i want to pick up specific data from the page and build my index from that extracted data. is it a feasible idea?

Leave a Reply

Your email address will not be published. Required fields are marked *