After the installation of Nutch as described in my previous post, you can either follow this tutorial without the need of thinking, or get a sense of how Nutch actually works beforehand. I recommend doing both in parallel. And since you won’t find the latter on the Apache Nutch Website, let me help you out in this matter. The following illustration depicts the major parts as well as the workflow of a crawl:
- The injector takes all the URLs of the nutch.txt file and adds them to the crawldb. As a central part of Nutch, the crawldb maintains information on all known URLs (fetch schedule, fetch status, metadata, …).
- Based on the data of crawldb, the generator creates a fetchlist and places it in a newly created segment directory.
- Next, the fetcher gets the content of the URLs on the fetchlist and writes it back to the segment directory. This step usually is the most time-consuming one.
- Now the parser processes the content of each web page and for example omits all html tags. If the crawl functions as an update or an extension to an already existing one (e.g. depth of 3), the updater would add the new data to the crawldb as a next step.
- Before indexing, all the links need to be inverted, which takes into account that not the number of outgoing links of a web page is of interest, but rather the number of inbound links. This is quite similar to how Google PageRank works and is important for the scoring function. The inverted links are saved in the linkdb.
- and 7. Using data from all possible sources (crawldb, linkdb and segments), the indexer creates an index and saves it within the Solr directory. For indexing, the popular Lucene library is used. Now, the user can search for information regarding the crawled web pages via Solr.
In addition, filters, normalizers and plugins allow Nutch to be highly modular, flexible and very customizable throughout the whole process. This aspect is also pointed out in the picture above. I hope, this post helped you getting a basic understanding of the inner workings of Nutch.