Stormcrawler, the status index and re-crawling

Question

So we have stormcrawler running successfully, with the main index currently having a little over 2 million urls from our various websites indexed into it. That is working good, however SC doesn't seem to be re-indexing urls it indexed previously, and I am trying to sort out why.

I have tried searching for details on exactly how SC chooses it's next url from the status index. It does not seem to choose the oldest nextFetchDate, because we have docs in the status table with a nextFetchDate of Feb 3rd, 2019.

Looking through the logs, I see entries like:

2019-03-20 09:21:17.221 c.d.s.e.p.AggregationSpout Thread-29-spout-executor[17 17] [INFO] [spout #5]  Populating buffer with nextFetchDate <= 2019-03-20T09:21:17-04:00

and that seems to imply that SC does not look at any url in the status table with a date in the past. Is that correct? If SC gets overwhelmed with a slew of urls and cannot crawl all of them by their nextFetchDate, do some fall through the cracks?

Doing a query for documents in the status index with a nextFetchDate of older than today, I see 1.4 million of the 2 million urls have a nextFetchDate in the past.

It would be nice if the crawler could fetch the url with the oldest nextFetchDate and start crawling there.

How to I re-queue up those urls that were missed on their nextFetchDate?

Answer 1

By default, the ES spouts will get the oldest records. What the logs are showing does not contradict that: it asks for the records with a nextFetchDate lower than 20th March for the shard #5.

The nextFetchDate should actually be thought of as 'don't crawl before date D', nothing falls through the cracks.

Doing a query for documents in the status index with a nextFetchDate of older than today, I see 1.4 million of the 2 million urls have a nextFetchDate in the past.

yep, that's normal.

It would be nice if the crawler could fetch the url with the oldest nextFetchDate and start crawling there.

that's what it does

How to I re-queue up those urls that were missed on their nextFetchDate?

they are not missed. They should be picked by the spouts

Maybe check that the number of spouts matches the number of shards you have on the status index. Each spout instance is in charge of a shard, if you have less instances than shards, then these shards will never be queried.

Inspect the logs for those particular URLs which should be fetched first: do they get sent by the spouts at all? You might need to turn the logs to DEBUG for that.

Stormcrawler, the status index and re-crawling

Question

1 answers

solution1
0 ACCPTED 2019-03-21 12:03:22

Stormcrawler, the status index and re-crawling

Question

1 answers

solution1 0 ACCPTED 2019-03-21 12:03:22

solution1
0 ACCPTED 2019-03-21 12:03:22