简体   繁体   中英

Nutch/Elastic Search terms definition

I used nutch and Elastisearch to crawl/parse 99 websites/links in order to index them in Elasicsearch so that I can use the search engine. It did crawl all the 99 websites/links but the end message I get is as follows. I am trying to understand what redirects, add/update mean? And if it is possible to find out which are gone and redirects?

Indexer: number of documents indexed, deleted, or skipped:
Indexer:      5  deleted (gone)
Indexer:      8  deleted (redirects)
Indexer:     76  indexed (add/update)
Indexer: finished at 2020-12-17 13:07:19, elapsed: 00:00:08

Nutch does not know whether a page is already in the index. In order to keep the index and the crawled content in sync,

  • successfully fetched pages are sent to the index and counted as additions or updates
  • (with indexer option -deleteGone ) 404s and otherwise failed fetches are deleted from the index and counted as "gone"
  • same for redirects but counted separately as "redirects"

And if it is possible to find out which are gone and redirects?

You can use the Nutch tools

  • readdb to dump the CrawlDb
  • readseg to dump the segment which was indexed

and then search for 404s, fetch failures, redirects, etc. Calling bin/nutch readdb resp. bin/nutch readseg will show you all available command-line options.

"Gone" means that the website or document is no longer available. This can occur if the website or document has been deleted or if the URL has changed.

"Redirects" means that the website or document has been moved to a new URL. When a website or document is redirected, the old URL will automatically redirect to the new URL. This is often done to update the URL of a website or document or to consolidate multiple URLs into one.

The "add/update" status means that the website or document has been successfully indexed and either added as a new entry in the Elasticsearch index or updated if it already exists.

To find out which websites or documents were deleted or redirected, you can check the logs or try accessing the URLs of the websites or documents to see if they are still available or if they redirect to a new URL. You can also check the Elasticsearch index to see if the websites or documents are still present.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM