简体   繁体   English

抓取Deep Web中的Nutch 2.3.1

[英]Nutch 2.3.1 in crawl Deep Web

i follow the tutorial from 我遵循以下教程

  1. Nutch Wiki "SetupNutchAndTor"( https://wiki.apache.org/nutch/SetupNutchAndTor ) Nutch Wiki“ SetupNutchAndTor”( https://wiki.apache.org/nutch/SetupNutchAndTor

  2. Set up nutch-site.xml 设置nutch-site.xml

      <property> <name>http.proxy.host</name> <value>127.0.0.1</value> <description>The proxy hostname. If empty, no proxy is used. </description> </property> <property> <name>http.proxy.port</name> <value>8118</value> <description>The proxy port.</description> </property> 

but still crawl nothing from the .onion link and not indexed into Solr. 但仍然无法从.onion链接进行任何爬网,也没有索引到Solr中。 Anyone know what is the problem? 有人知道是什么问题吗?

Anything in the logs? 日志中有什么?

FYI with StormCrawler you can use a SOCKS proxy directly thanks to this commit 仅供参考,使用StormCrawler,由于此提交,您可以直接使用SOCKS代理

You'd need to use OKHTTP for the protocol implementation and configure it like this 您需要使用OKHTTP进行协议实现,并像这样配置它

http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol" https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol" http.protocol.implementation:“ com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol” https.protocol.implementation:“ com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol”

http.proxy.host: localhost http.proxy.host:本地主机
http.proxy.port: 9050 http.proxy.port:9050
http.proxy.type: "SOCKS" http.proxy.type:“袜子”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM