简体   繁体   中英

how to limit the crawling depth in stormcrawler

my use-case is to extract text from website's any page and from the outlinks(that are present on that page only) only on daily basis. eg i want to crawl all the links of ( https://www.indiatimes.com/news/world ) present on this page. this gives me few fresh news articles everyday. there are around 30-40 news articles links on this page everyday that i want to crawl and store in my database.

these are some configuration that i have for now -

here is the part of crawler-conf.yaml -

  parser.emitOutlinks: true
  perser.emitOutlinks.max.per.page: 0
  track.anchors: true
  metadata.track.path: true
  metadata.track.depth: true

here is the part of urlfilters.json-

 {
         "class":  "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter",
          "name": "MaxDepthFilter",
           "params": {
             "maxDepth": 0
                   }
 }

on these configurations this eg page is giving more than 35000 hits. it crawls the whole website that i dont need. it is getting more and more urls from outlinks. if i change maxdepth parameter to 1 or 0 or 2, behaviour of crawl remains same. is maxdepth parameter is right for this use case? i want to limit this recursive nature of crawl to only the seed URL and outlinks of seed url. what does maxdepth parameter actually mean? what should i do to limit the crawl's expansion.

i am using stromcrawler 1.16.

this is exactly what the max depth filter is for. remember you need to rebuild your JAR with mvn clean package for any changes to the urlfilters.json to take effect.

if you don't want any outlinks when parsing a page, simply set parser.emitOutlinks to false in the config.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM