简体   繁体   English

stormcrawler:indexer.md.mapping-如果元数据标记不存在会发生什么?

[英]stormcrawler: indexer.md.mapping - what happens if the metadata tag does not exist?

We have been having a weird issue with Stormcrawler 1.13. 关于Stormcrawler 1.13,我们一直遇到一个奇怪的问题。 On some (but not all) of our sites, we have a <meta name="college" content="thiscollege"/> tag, and SC has the indexer.md.mapping set to - parse.college=college . 在我们的某些(但不是全部)网站上,我们有一个<meta name="college" content="thiscollege"/>标记,而SC的indexer.md.mapping设置为- parse.college=college This seems to work correctly for the sites that have that meta tag set. 对于设置了该元标记的网站,这似乎可以正常工作。

The problem we are running into is that if metatag is set to thiscollege1 for pages 3.html, 4.html, and 5.html, then the crawler hits page25.html that does not have the meta tag, it appears to be re-using the value thiscollege1 for the meta tag from 5.html and just stuffing it into the college field in the Elastic index. 我们thiscollege1的问题是,如果将页面3.html,4.html和5.html的metatag设置为thiscollege1 ,那么抓取工具就会找到没有meta标签的page25.html,这似乎是重新对thiscollege1中的meta标签使用值thiscollege1 ,并将其填充到Elastic index中的college字段中。

Is there a way to set that so that it zeroes out or unsets that variable every time it heads to a new page so that the variable is not carried over? 有没有一种方法可以设置该变量,以使该变量在每次转到新页面时都将其清零或取消设置,以使该变量不会被继承?

Any advice on how to tweak this setting would be most appreciated! 任何有关如何调整此设置的建议将不胜感激!

It's been a bugger of a problem to chase down, as some records just seem to have random entries in them. 追逐问题一直是个麻烦,因为某些记录似乎只包含随机条目。 It wasn't till I matched up the records with some of the status records, sorted by NextFetchDate, that I saw that it could be a carried over variable. 直到我将记录与某些状态记录(按NextFetchDate排序)进行匹配,我才发现它可能是一个结转变量。 I am going to try to set up a specific test with just a couple pages to specifically prove/disprove the theory, but right now it's the only thing that fits what is happening. 我将尝试仅用几页来设置一个特定的测试,以专门证明/反驳该理论,但是现在,这是唯一适合发生的事情的方法。

Any ideas welcome! 任何想法欢迎!

仅当您在config metadata.transfer的值中列出parse.college时 ,才应该发生这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM