Skip to content

Building your own search engine using nutch

March 10, 2011

I don’t often post technical step-by-step things here, but for the benefit of my Learning bit by bit class I am posting the steps I went through to configure nutch and tomcat on my system. I am running mac os 10.5.8 with java 1.6. A followup post may show how to customize nutch,

—-

Download Tomcat and nutch

http://tomcat.apache.org/

http://nutch.apache.org/

Get Nutch working

Configure your .bash_profile. you can open this by opening  a new terminal window and typing

$ emacs .bash_profile

add

JRE_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
export JRE_HOME
PATH="/Users/heather/nutch-1.2/bin:/Users/heather/apache-tomcat-7.0.10/bin:{$PATH}"
export PATH

to the bottom of your file. (obviously check and change for you own configuration)

restart terminal.

try running typing nutch at the prompt – should show usage.

Get tomcat working

In terminal, cd into the tomcat bin directory

chmod a+x on  files in tomcat bin dir

heather-dewey-hagborgs-macbook:bin heather$ chmod a+x shutdown.sh
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x startup.sh
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x setclasspath.sh
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x bootstrap.jar
heather-dewey-hagborgs-macbook:bin heather$ chmod a+x tomcat-juli.jar

In your favorite web browser, goto http://localhost:8080/

should show you the tomcat installed page!

———

CREATING AN INDEX

To configure things for crawling you must:

Create a plain text file named urls. Thats it, no extension. Add your root URL to this file, ie.

http://global.nytimes.com/

  1. Edit the file <nutch>/conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*nytimes.com

This will include any url in the domain nytimes.org.

Edit conf/nutch-default.xml to add a user agent

<name>http.agent.name</name>  <value>HDH</value>

RUN

This will run nutch on the url file, save the crawl in the nutch/crawl dir, crawl to a depth of 3 and grab the first 25 pages.

$ nutch crawl urls -dir nutch-1.2/crawl -depth 3 –topN 25

VIEW STATS

$ nutch readdb nutch-1.2/crawl/crawldb -stats

SEARCHING

Test search on cmd line:

$ nutch org.apache.nutch.searcher.NutchBean <search term>

nutch/conf/nutch-site.xml may need

<property>
<name>searcher.dir</name>
<value>/Users/heather/nutch-1.2/crawl</value>
</property>

added to its config

Searching through tomcat:

When you unzip your Nutch installation, you should find the nutch.war file. Place this

war file in your Tomcat webapps directory

rename as ROOT.war

Change /tomcat_dir/webapps/ROOT/WEB-INF/classes/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>searcher.dir</name>
<value>/Users/heather/nutch-1.2/crawl.test/</value>
</property></configuration>

Start or restart tomcat

$ startup.sh

Now go to

http://localhost:8080/

and search!!

Links:

Nutch + OSX: http://wiki.apache.org/nutch/GettingNutchRunningWithMacOsx

Wiki: http://wiki.apache.org/nutch

Maintenance: http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

http://nutch.sourceforge.net/docs/en/tutorial.html

http://www.cnblogs.com/galaxyprince/archive/2010/04/04/1704095.html

About these ads
3 Comments leave one →
  1. Yaman permalink
    January 25, 2013 4:46 am

    Its not easy on mac and ubuntu as you mentioned above. Is it possible for you to write a step by step document for setting up Nutch 2.x / Hbase/ Solr.

    • January 25, 2013 6:57 pm

      Yaman,
      I wrote this guide years ago, if it doesn’t work any longer I’m sorry to hear it but I won’t be updating unless I need to use the technologies again myself.

  2. May 13, 2013 6:05 pm

    Wow, this article is pleasant, my younger sister
    is analyzing such things, therefore I am going to let know her.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 60 other followers

%d bloggers like this: