抓取Deep Web中的Nutch 2.3.1

Question

i follow the tutorial from 我遵循以下教程

Nutch Wiki "SetupNutchAndTor"( https://wiki.apache.org/nutch/SetupNutchAndTor ) Nutch Wiki“ SetupNutchAndTor”（ https://wiki.apache.org/nutch/SetupNutchAndTor ）

Set up nutch-site.xml 设置nutch-site.xml

  <property> <name>http.proxy.host</name> <value>127.0.0.1</value> <description>The proxy hostname. If empty, no proxy is used. </description> </property> <property> <name>http.proxy.port</name> <value>8118</value> <description>The proxy port.</description> </property>

but still crawl nothing from the .onion link and not indexed into Solr. 但仍然无法从.onion链接进行任何爬网，也没有索引到Solr中。 Anyone know what is the problem? 有人知道是什么问题吗？

Answer 1

Anything in the logs? 日志中有什么？

FYI with StormCrawler you can use a SOCKS proxy directly thanks to this commit 仅供参考，使用StormCrawler，由于此提交，您可以直接使用SOCKS代理

You'd need to use OKHTTP for the protocol implementation and configure it like this 您需要使用OKHTTP进行协议实现，并像这样配置它

http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol" https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol" http.protocol.implementation：“ com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol” https.protocol.implementation：“ com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol”

http.proxy.host: localhost http.proxy.host：本地主机
http.proxy.port: 9050 http.proxy.port：9050
http.proxy.type: "SOCKS" http.proxy.type：“袜子”

抓取Deep Web中的Nutch 2.3.1

问题描述

1 个解决方案

解决方案1
0 2018-02-09 18:20:12

抓取Deep Web中的Nutch 2.3.1

问题描述

1 个解决方案

解决方案1 0 2018-02-09 18:20:12

解决方案1
0 2018-02-09 18:20:12