如何限制stormcrawler中的爬行深度

Question

my use-case is to extract text from website's any page and from the outlinks(that are present on that page only) only on daily basis.我的用例是仅每天从网站的任何页面和外链（仅存在于该页面上）中提取文本。 eg i want to crawl all the links of ( https://www.indiatimes.com/news/world ) present on this page.例如，我想抓取此页面上存在的（ https://www.indiatimes.com/news/world ）的所有链接。 this gives me few fresh news articles everyday.这给了我每天几篇新鲜的新闻文章。 there are around 30-40 news articles links on this page everyday that i want to crawl and store in my database.每天这个页面上大约有 30-40 个新闻文章链接，我想抓取并存储在我的数据库中。

these are some configuration that i have for now -这些是我现在拥有的一些配置-

here is the part of crawler-conf.yaml -这是 crawler-conf.yaml 的一部分 -

  parser.emitOutlinks: true
  perser.emitOutlinks.max.per.page: 0
  track.anchors: true
  metadata.track.path: true
  metadata.track.depth: true

here is the part of urlfilters.json-这是 urlfilters.json 的一部分-

 {
         "class":  "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter",
          "name": "MaxDepthFilter",
           "params": {
             "maxDepth": 0
                   }
 }

on these configurations this eg page is giving more than 35000 hits.在这些配置上，这个例如页面提供了超过 35000 次点击。 it crawls the whole website that i dont need.它会抓取我不需要的整个网站。 it is getting more and more urls from outlinks.它从外链获得越来越多的 url。 if i change maxdepth parameter to 1 or 0 or 2, behaviour of crawl remains same.如果我将 maxdepth 参数更改为 1 或 0 或 2，爬行的行为保持不变。 is maxdepth parameter is right for this use case? maxdepth 参数是否适合此用例？ i want to limit this recursive nature of crawl to only the seed URL and outlinks of seed url.我想将爬行的这种递归性质限制为仅种子 URL 和种子 url 的外链。 what does maxdepth parameter actually mean? maxdepth 参数实际上是什么意思？ what should i do to limit the crawl's expansion.我应该怎么做才能限制爬网的扩展。

i am using stromcrawler 1.16.我正在使用 stromcrawler 1.16。

Answer 1

this is exactly what the max depth filter is for.这正是最大深度过滤器的用途。 remember you need to rebuild your JAR with mvn clean package for any changes to the urlfilters.json to take effect.请记住，您需要使用mvn clean package重建您的 JAR 以使对 urlfilters.json 的任何更改生效。

if you don't want any outlinks when parsing a page, simply set parser.emitOutlinks to false in the config.如果您在解析页面时不想要任何外链，只需在配置中将parser.emitOutlinks设置为 false。

如何限制stormcrawler中的爬行深度

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-22 11:57:11

如何限制stormcrawler中的爬行深度

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-22 11:57:11

解决方案1
0 已采纳 2020-06-22 11:57:11