要将meta标签捕获到索引中，正确的Stormcrawler设置是什么？

Question

UPDATE: I figured it out. 更新：我想通了。 see the bottom...but feel free to correct me if I missed anything... 见底...但是如果我错过任何事情，请随时纠正我...

What are the proper settings in the crawler-conf.yaml (and elsewhere, if needed) for the info from the following meta-tag: 对于以下元标记中的信息， crawler-conf.yaml （以及其他地方，如果需要）中的正确设置是什么：

<meta name="college" content="artdesign"/>

to be properly captured into an index with the field name of either 'college' or 'seed'? 正确地捕获到字段名称为“学院”或“种子”的索引中？

I see the following settings that may need to be set, but have tried various variations on them, and the data does not seem to be captured. 我看到可能需要设置以下设置，但是尝试了各种设置，但似乎无法捕获数据。

in crawler-conf.yaml : 在crawler-conf.yaml ：

# lists the metadata to persist to storage
  # these are not transfered to the outlinks
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed
   - college
   - seed

not sure if 'persists to storage' means into an index? 不确定“持久存储”是否意味着要进入索引？

The other option in the crawler-conf.yaml is: crawler-conf.yaml的另一个选项是：

# configuration for the classes extending AbstractIndexerBolt
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain
  - college=college
  - college=seed

I had previously asked about the fact that for a while some values for 'seed' seemed to be propagating to documents fetched that did not have a meta tag. 我之前曾问过一个事实，即一段时间以来，“ seed”的某些值似乎正在传播到所获取的没有元标记的文档中。 That setting was: 该设置为：

  # metadata to transfer to the outlinks
  # used by Fetcher for redirections, sitemapparser, etc...
  # these are also persisted for the parent document (see below)
  # metadata.transfer:
  # - seed

So my question, as asked in the title, is how do I configure these options in the crawler-conf.yaml (or any other config) to reliably capture the data from the meta tag listed at the top of this question, without also propagating it to fetched documents that do not have that meta tag? 因此，正如标题中所述，我的问题是我如何在crawler-conf.yaml （或任何其他配置）中配置这些选项，以可靠地从该问题顶部列出的meta标记中捕获数据，而无需传播它可以获取没有该meta标签的文档？

Answer 1

Here's what I sorted out. 这是我整理的。 the 'parse' that is referenced in the 'parse.title' in the quoted code above is a reference to (edit: the key of the meta tag, which is then retrieved by) a custom entry under the top class in the src/main/resources/parsefilters.json file. 上面引用的代码中'parse.title'中引用的'parse'是对src/main/resources/parsefilters.json顶级类下的一个自定义条目的引用（edit：meta标签的键，然后由其检索）。 src/main/resources/parsefilters.json文件。 I went in there and added a 我去那里并添加了一个

"parse.college": "//META[@name=\\"college\\"]/@content"

line underneath the ones that were there but still within the top class. 排在那里，但仍在顶级班级中。

I then changed the reference to college under indexer.md.mapping to read - parse.college=college and rebuilt the crawler and ran it. 然后，我indexer.md.mapping下的大学的引用更改为读取- parse.college=college并重新构建了- parse.college=college器并运行了它。 It then started properly grabbing the <meta name="college" content="artdesign"/> tag and sending it to a college field in the index. 然后，它开始正确地抓取<meta name="college" content="artdesign"/>标记并将其发送到索引中的college字段。

要将meta标签捕获到索引中，正确的Stormcrawler设置是什么？

问题描述

UPDATE: I figured it out. 更新：我想通了。 see the bottom...but feel free to correct me if I missed anything... 见底...但是如果我错过任何事情，请随时纠正我...

1 个解决方案

解决方案1
1 已采纳 2019-06-11 01:47:04

要将meta标签捕获到索引中，正确的Stormcrawler设置是什么？

问题描述

UPDATE: I figured it out. 更新：我想通了。 see the bottom...but feel free to correct me if I missed anything... 见底...但是如果我错过任何事情，请随时纠正我...

1 个解决方案

解决方案1 1 已采纳 2019-06-11 01:47:04

解决方案1
1 已采纳 2019-06-11 01:47:04