[英]What happens when a previously “FETCHED” url is removed on the web server side and StormCrawler goes to it again?
We have lots of sites being updated, added, and deleted. 我们有许多网站正在更新,添加和删除。 I'm curious as to how Stormcrawler handles a site with a url that has been previously "FETCHED", when the next time SC reaches it it has been removed and either generates a redirect or a 404. What happens to the content that is from the old version of the page, in the "Index" index?
我对Stormcrawler如何处理以前带有“ FETCHED” URL的网站感到好奇,当下一次SC到达该URL时,它已被删除并生成重定向或404。页面的旧版本,在“索引”索引中?
I know the url in the "Status" index probably changes to "REDIRECTION" or "FETCH ERROR" or something, but what about the content itself? 我知道“状态”索引中的网址可能会更改为“重定向”或“错误”,但是内容本身呢? Is it deleted?
它被删除了吗? Is it left?
剩下了吗 I am trying to figure out how SC reacts here and if I have to work at cleaning up these orphaned docs in the "Index" index.
我试图弄清楚SC在这里的反应,以及是否必须清理“索引”索引中的这些孤立文档。
I would expect SC to delete the content if it's no longer there, but I thought I would ask to be sure. 我希望SC删除内容,如果它不再存在,但我想我要确保。
As you pointed out, a missing URL will get a FETCH_ERROR status, which after being retried a number of times (param max.fetch.errors - default 3) will turn into an ERROR status. 正如您所指出的那样,缺少的URL将具有FETCH_ERROR状态,重试多次(参数max.fetch.errors-默认值为 3)后,该状态将变为ERROR状态。
The content will be deleted if you connect a DeletionBolt to the status updater, see example topology . 如果将DeletionBolt连接到状态更新程序,则内容将被删除,请参见示例拓扑 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.