简体   繁体   中英

Stormcrawler - how does the es.status.filterQuery work?

I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.

I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:

es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"

Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?

See code of the AggregationSpout .

how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?

yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.

It is a positive filter ie the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM