简体   繁体   English

如果在Web服务器端删除了以前的“ FETCHED” URL,StormCrawler再次访问该URL,会发生什么情况?

[英]What happens when a previously “FETCHED” url is removed on the web server side and StormCrawler goes to it again?

We have lots of sites being updated, added, and deleted. 我们有许多网站正在更新,添加和删除。 I'm curious as to how Stormcrawler handles a site with a url that has been previously "FETCHED", when the next time SC reaches it it has been removed and either generates a redirect or a 404. What happens to the content that is from the old version of the page, in the "Index" index? 我对Stormcrawler如何处理以前带有“ FETCHED” URL的网站感到好奇,当下一次SC到达该URL时,它已被删除并生成重定向或404。页面的旧版本,在“索引”索引中?

I know the url in the "Status" index probably changes to "REDIRECTION" or "FETCH ERROR" or something, but what about the content itself? 我知道“状态”索引中的网址可能会更改为“重定向”或“错误”,但是内容本身呢? Is it deleted? 它被删除了吗? Is it left? 剩下了吗 I am trying to figure out how SC reacts here and if I have to work at cleaning up these orphaned docs in the "Index" index. 我试图弄清楚SC在这里的反应,以及是否必须清理“索引”索引中的这些孤立文档。

I would expect SC to delete the content if it's no longer there, but I thought I would ask to be sure. 我希望SC删除内容,如果它不再存在,但我想我要确保。

As you pointed out, a missing URL will get a FETCH_ERROR status, which after being retried a number of times (param max.fetch.errors - default 3) will turn into an ERROR status. 正如您所指出的那样,缺少的URL将具有FETCH_ERROR状态,重试多次(参数max.fetch.errors-默认值为 3)后,该状态将变为ERROR状态。

The content will be deleted if you connect a DeletionBolt to the status updater, see example topology . 如果将DeletionBolt连接到状态更新程序,则内容将被删除,请参见示例拓扑

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 stormcrawler:indexer.md.mapping-如果元数据标记不存在会发生什么? - stormcrawler: indexer.md.mapping - what happens if the metadata tag does not exist? 当我停止Rails服务器时,我的Elasticsearch索引会怎样? - What happens to my elasticsearch index when I stop rails server? 在GitLab中使用ElasticSearch,如果ES Container崩溃了怎么办? - Using ElasticSearch in GitLab, what happens if ES Container goes down? 我可以从 Stormcrawler 获得哪些值/字段? - What values/fields can I get from stormcrawler? 要将meta标签捕获到索引中,正确的Stormcrawler设置是什么? - What is the proper Stormcrawler settings to capture a meta tag into an index? 当Auditbeat的输出结束时会发生什么 - What happens when the ouput of Auditbeat is down Elasticsearch-py 如果 ES 服务器宕机,连接尝试会发生什么? - Elasticsearch-py what happens to connection attempt if ES server is down? 自定义 StormCrawler - Customizing StormCrawler 在日期字段上执行匹配查询时会发生什么 - what happens when performing a match query on a date field 如果我在发生重新索引时将数据写入 Elasticsearch 索引会发生什么 - What happens if I write data to an Elasticsearch index when an reindex is occuring
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM