What is the proper Stormcrawler settings to capture a meta tag into an index?

Question

UPDATE: I figured it out. see the bottom...but feel free to correct me if I missed anything...

What are the proper settings in the crawler-conf.yaml (and elsewhere, if needed) for the info from the following meta-tag:

<meta name="college" content="artdesign"/>

to be properly captured into an index with the field name of either 'college' or 'seed'?

I see the following settings that may need to be set, but have tried various variations on them, and the data does not seem to be captured.

in crawler-conf.yaml :

# lists the metadata to persist to storage
  # these are not transfered to the outlinks
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed
   - college
   - seed

not sure if 'persists to storage' means into an index?

The other option in the crawler-conf.yaml is:

# configuration for the classes extending AbstractIndexerBolt
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain
  - college=college
  - college=seed

I had previously asked about the fact that for a while some values for 'seed' seemed to be propagating to documents fetched that did not have a meta tag. That setting was:

  # metadata to transfer to the outlinks
  # used by Fetcher for redirections, sitemapparser, etc...
  # these are also persisted for the parent document (see below)
  # metadata.transfer:
  # - seed

So my question, as asked in the title, is how do I configure these options in the crawler-conf.yaml (or any other config) to reliably capture the data from the meta tag listed at the top of this question, without also propagating it to fetched documents that do not have that meta tag?

Answer 1

Here's what I sorted out. the 'parse' that is referenced in the 'parse.title' in the quoted code above is a reference to (edit: the key of the meta tag, which is then retrieved by) a custom entry under the top class in the src/main/resources/parsefilters.json file. I went in there and added a

"parse.college": "//META[@name=\\"college\\"]/@content"

line underneath the ones that were there but still within the top class.

I then changed the reference to college under indexer.md.mapping to read - parse.college=college and rebuilt the crawler and ran it. It then started properly grabbing the <meta name="college" content="artdesign"/> tag and sending it to a college field in the index.

What is the proper Stormcrawler settings to capture a meta tag into an index?

Question

UPDATE: I figured it out. see the bottom...but feel free to correct me if I missed anything...

1 answers

solution1
1 ACCPTED 2019-06-11 01:47:04

What is the proper Stormcrawler settings to capture a meta tag into an index?

Question

UPDATE: I figured it out. see the bottom...but feel free to correct me if I missed anything...

1 answers

solution1 1 ACCPTED 2019-06-11 01:47:04

solution1
1 ACCPTED 2019-06-11 01:47:04