Skip to content

Building your own search engine using nutch

March 10, 2011

I don’t often post technical step-by-step things here, but for the benefit of my Learning bit by bit class I am posting the steps I went through to configure nutch and tomcat on my system. I am running mac os 10.5.8 with java 1.6. A followup post may show how to customize nutch,


Download Tomcat and nutch

Get Nutch working

Configure your .bash_profile. you can open this by opening  a new terminal window and typing

$ emacs .bash_profile


export JRE_HOME
export PATH

to the bottom of your file. (obviously check and change for you own configuration)

restart terminal.

try running typing nutch at the prompt – should show usage.

Get tomcat working

In terminal, cd into the tomcat bin directory

chmod a+x on  files in tomcat bin dir

heather-dewey-hagborgs-macbook:bin heather$ chmod a+x
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x bootstrap.jar
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x tomcat-juli.jar

In your favorite web browser, goto http://localhost:8080/

should show you the tomcat installed page!



To configure things for crawling you must:

Create a plain text file named urls. Thats it, no extension. Add your root URL to this file, ie.

  1. Edit the file <nutch>/conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the domain, the line should read:
# accept hosts in MY.DOMAIN.NAME

This will include any url in the domain

Edit conf/nutch-default.xml to add a user agent

<name></name>  <value>HDH</value>


This will run nutch on the url file, save the crawl in the nutch/crawl dir, crawl to a depth of 3 and grab the first 25 pages.

$ nutch crawl urls -dir nutch-1.2/crawl -depth 3 –topN 25


$ nutch readdb nutch-1.2/crawl/crawldb -stats


Test search on cmd line:

$ nutch org.apache.nutch.searcher.NutchBean <search term>

nutch/conf/nutch-site.xml may need


added to its config

Searching through tomcat:

When you unzip your Nutch installation, you should find the nutch.war file. Place this

war file in your Tomcat webapps directory

rename as ROOT.war

Change /tomcat_dir/webapps/ROOT/WEB-INF/classes/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->

Start or restart tomcat


Now go to


and search!!


Nutch + OSX:



4 Comments leave one →
  1. Yaman permalink
    January 25, 2013 4:46 am

    Its not easy on mac and ubuntu as you mentioned above. Is it possible for you to write a step by step document for setting up Nutch 2.x / Hbase/ Solr.

    • January 25, 2013 6:57 pm

      I wrote this guide years ago, if it doesn’t work any longer I’m sorry to hear it but I won’t be updating unless I need to use the technologies again myself.

  2. May 13, 2013 6:05 pm

    Wow, this article is pleasant, my younger sister
    is analyzing such things, therefore I am going to let know her.

  3. May 25, 2013 4:59 am

    Right away I am ready to do my breakfast, afterward having
    my breakfast coming yet again to read more news.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 70 other followers

%d bloggers like this: