Stormcrawler - how does the es.status.filterQuery work?

Question

I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.

I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:

es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"

Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?

Answer 1

See code of the AggregationSpout .

how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?

yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.

It is a positive filter ie the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.

Stormcrawler - how does the es.status.filterQuery work?

Question

1 answers

solution1
1 ACCPTED 2019-04-26 13:09:32

Stormcrawler - how does the es.status.filterQuery work?

Question

1 answers

solution1 1 ACCPTED 2019-04-26 13:09:32

solution1
1 ACCPTED 2019-04-26 13:09:32