简体   繁体   English

如何从不同的文件读取URL并设置不同的深度进行爬网?

[英]How to read urls from different files and set different depth for crawling?

I want to have two files seed.txt and seed2.txt and in each file to have different urls. 我想要两个文件seed.txt和seed2.txt,并且每个文件中都有不同的url。 In seed.txt the depth for crawling i want to be for ex. 在seed.txt中,爬网的深度是我想要的。 2 and in seed2.txt the depth to be 3. 2,在seed2.txt中,深度为3。
Is there any solution or workaround to do this?? 是否有任何解决方案或解决方法来做到这一点?

I want to have two files seed.txt and seed2.txt and in each file to have different urls 我想要两个文件seed.txt和seed2.txt,并且每个文件中都有不同的网址

You need to maintain the seed file name as is; 您需要保持种子文件的名称不变。 do not rename it to seed2 etc. Instead, You can create two seperate urls directory with a seed file in each containing different set of urls. 请勿将其重命名为seed2等。相反,您可以创建两个单独的url目录,每个目录中包含一个种子文件,每个文件包含不同的url组。 Ex. 防爆。 folder 'urls1' will have one seed.txt and another folder 'urls2' will have another seed.txt with a different set of urls. 文件夹“ urls1”将具有一个seed.txt,另一个文件夹“ urls2”将具有另一个seed.txt,其中包含一组不同的URL。 But also make sure to create seperate crawl directories where the crawl data would go to (ex. create a 'crawl1' directory for seed.txt in 'urls1' folder and 'crawl2' directory for the 'seeds.txt' in 'urls2'. 但也要确保创建单独的爬网目录,爬网数据将进入该目录(例如,在“ urls1”文件夹中为seed.txt创建一个“ crawl1”目录,在“ urls2”中为“ seeds.txt”创建一个“ crawl2”目录。

In seed.txt the depth for crawling i want to be for ex. 在seed.txt中,爬网的深度是我想要的。 2 and in seed2.txt the depth to be 3. 2,在seed2.txt中,深度为3。

You should specify the depth value in your crawl command not in the seed.txt. 您应该在抓取命令中而不是seed.txt中指定深度值。 In your case, run the following commands in seperate terminals if running on the same machine (provided your nutch/hadoop configuration supports running multiple crawl jobs in parallel. 对于您的情况,如果在同一台计算机上运行,​​请在单独的终端上运行以下命令(前提是您的nutch / hadoop配置支持并行运行多个爬网作业。

  • bin/nutch crawl urls1 -dir crawl1 -depth 2 bin / nutch抓取网址1 -dir抓取1-深度2

  • bin/nutch crawl urls2 -dir crawl2 -depth 3 bin / nutch抓取urls2 -dir抓取2-深度3

Hope this helped! 希望这对您有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从不同的文件中将相对URL映射到不同的根? - How to have relative urls from different files map to different roots? apache aliasmatch可以将来自不同URL的文件名匹配到同一目录 - apache aliasmatch to to match files names from different urls to a same directory 如何防止 Google 抓取 UserDir URL(但不是真实域)? - How to prevent Google from crawling UserDir URLs (but not the real domain)? 不同的URL指向不同的目录 - Different urls point to different directories mod_rewrite:隐藏真实的URL,但保留为不同的文件 - mod_rewrite: hide real urls but keep available as different files 仅文件夹/无文件时如何“ SVN UP-设置深度无限” - How “SVN UP --set-depth infinity” with just folders/no files 如何从两个不同的文件夹(基于子域)加载内容并保留旧的URL? - How can I load content from two different folders (based on the subdomain) and keep the old URLs? 不同的日志到不同的文件 - Different logs to different files 如何将Codeigniter文件从一个目录访问到另一个目录 - How to access codeigniter files from one directory to another different directory 如何获得httpd转发到不同URL的多个tomcat,包括/? - How to get httpd to forward to multiple tomcats for different urls, including /?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM