第 10 章 Nutch-阿里云开发者社区

第 10 章 Nutch

2018-01-03 1158

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

http://lucene.apache.org/nutch/

How to Setup Nutch and Hadoop

http://wiki.apache.org/nutch/NutchHadoopTutorial

下载

$ cd /usr/local/src/
$ wget http://apache.etoak.com/lucene/nutch/nutch-1.0.tar.gz
$ tar zxvf nutch-1.0.tar.gz
$ sudo cp -r nutch-1.0 ..
$ cd ..
$ sudo ln -s nutch-1.0 apache-nutch

创建文件myurl

$ cd apache-nutch
$ mkdir urls
$ vim urls/myurl
http://netkiller.8800.org/

配置文件 crawl-urlfilter.txt

编辑conf/crawl-urlfilter.txt文件，修改MY.DOMAIN.NAME部分，把它替换为你想要抓取的域名

$ cp conf/crawl-urlfilter.txt conf/crawl-urlfilter.txt.old
$ vim conf/crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
修改为：
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*netkiller.8800.org/

http.agent.name

			
$ vim conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>http.agent.name</name>
  <value>Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://netkiller.8800.org/robot.html</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>openunix@163.com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

</configuration>

运行以下命令行开始工作

$ bin/nutch crawl urls -dir crawl -depth 3 -threads 5

			
bin/nutch crawl <your_url> -dir <your_dir> -depth 2 -threads 4 >&logs/logs1.log

urls 存放需要爬行的url文件的目录，即目录/nutch/urls。
-dir  dirnames    	设置保存所抓取网页的目录.
-depth  depth 		表明抓取网页的层次深度
-delay  delay		表明访问不同主机的延时，单位为“秒”
-threads  threads  	表明需要启动的线程数
-topN 50	topN	一个网站保存的最大页面数。


$ nohup bin/nutch crawl /usr/local/apache-nutch/urls -dir /usr/local/apache-nutch/crawl -depth 5 -threads 50 -topN 50 > /tmp/nutch.log &

depoly

			
$ cd /usr/local/apache-tomcat/conf/Catalina/localhost
$ vim nutch.xml
<Context docBase="/usr/local/apache-nutch/nutch-1.0.war" debug="0" crossContext="true" >
</Context>

searcher.dir

			
$ vim /usr/local/apache-tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>searcher.dir</name>
        <value>/usr/local/apache-nutch/crawl</value>
    </property>
</configuration>

test

http://172.16.0.1:8080/nutch/

原文出处：Netkiller 系列手札
本文作者：陈景峯
转载请与作者联系，同时请务必标明文章原始出处和作者信息及本声明。

第 10 章 Nutch

热门文章

最新文章

相关电子书