[英]How to read urls from different files and set different depth for crawling?
I want to have two files seed.txt and seed2.txt and in each file to have different urls. 我想要两个文件seed.txt和seed2.txt,并且每个文件中都有不同的url。 In seed.txt the depth for crawling i want to be for ex. 在seed.txt中,爬网的深度是我想要的。 2 and in seed2.txt the depth to be 3. 2,在seed2.txt中,深度为3。
Is there any solution or workaround to do this?? 是否有任何解决方案或解决方法来做到这一点?
I want to have two files seed.txt and seed2.txt and in each file to have different urls 我想要两个文件seed.txt和seed2.txt,并且每个文件中都有不同的网址
You need to maintain the seed file name as is; 您需要保持种子文件的名称不变。 do not rename it to seed2 etc. Instead, You can create two seperate urls directory with a seed file in each containing different set of urls. 请勿将其重命名为seed2等。相反,您可以创建两个单独的url目录,每个目录中包含一个种子文件,每个文件包含不同的url组。 Ex. 防爆。 folder 'urls1' will have one seed.txt and another folder 'urls2' will have another seed.txt with a different set of urls. 文件夹“ urls1”将具有一个seed.txt,另一个文件夹“ urls2”将具有另一个seed.txt,其中包含一组不同的URL。 But also make sure to create seperate crawl directories where the crawl data would go to (ex. create a 'crawl1' directory for seed.txt in 'urls1' folder and 'crawl2' directory for the 'seeds.txt' in 'urls2'. 但也要确保创建单独的爬网目录,爬网数据将进入该目录(例如,在“ urls1”文件夹中为seed.txt创建一个“ crawl1”目录,在“ urls2”中为“ seeds.txt”创建一个“ crawl2”目录。
In seed.txt the depth for crawling i want to be for ex. 在seed.txt中,爬网的深度是我想要的。 2 and in seed2.txt the depth to be 3. 2,在seed2.txt中,深度为3。
You should specify the depth value in your crawl command not in the seed.txt. 您应该在抓取命令中而不是seed.txt中指定深度值。 In your case, run the following commands in seperate terminals if running on the same machine (provided your nutch/hadoop configuration supports running multiple crawl jobs in parallel. 对于您的情况,如果在同一台计算机上运行,请在单独的终端上运行以下命令(前提是您的nutch / hadoop配置支持并行运行多个爬网作业。
bin/nutch crawl urls1 -dir crawl1 -depth 2 bin / nutch抓取网址1 -dir抓取1-深度2
bin/nutch crawl urls2 -dir crawl2 -depth 3 bin / nutch抓取urls2 -dir抓取2-深度3
Hope this helped! 希望这对您有所帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.