[英]StormCrawler not parsing Tika metadata
When adding the Tika parser to StormCrawler, no information is extracted from this field and stored in ElasticSearch.将 Tika 解析器添加到 StormCrawler 时,不会从该字段中提取任何信息并将其存储在 ElasticSearch 中。
es-crawler.flux es-crawler.flux
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds.txt"
- true
bolts:
- id: "filter"
className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
parallelism: 1
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "tika_redirection"
className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
parallelism: 1
- id: "tika_parser"
className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
parallelism: 1
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 1
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "tika_redirection"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "tika_redirection"
to: "tika_parser"
grouping:
type: LOCAL_OR_SHUFFLE
streamId: "tika"
- from: "tika_parser"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "tika_parser"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "filter"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filter"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
I added these settings to crawler-conf.yaml:我将这些设置添加到 crawler-conf.yaml:
crawler-conf.yaml爬虫-conf.yaml
parser.mimetype.whitelist:
- application/.*pdf.*
jsoup.treat.non.html.as.error: false
Also, I find the following log when running the topology:此外,我在运行拓扑时发现以下日志:
16:27:29.867 [Thread-43-tika_parser-executor[22, 22]] INFO c.d.s.t.ParserBolt - skipped_trimmed -> http://cds.iisc.ac.in/wp-content/uploads/DS256.2017.Storm_.Tutorial.pdf
I prefer to extract all possible fields from a pdf, and store the information from the pages using an array, thus, one page becomes one element in an array in Elasticsearch.我更喜欢从 pdf 中提取所有可能的字段,并使用数组存储页面中的信息,因此,一页成为 Elasticsearch 中数组中的一个元素。
See ParserBolt - the parsing does not happen if the document was trimmed during fetching.请参阅ParserBolt - 如果在获取期间修剪文档,则不会进行解析。
You can disable the trimming in the conf with您可以在 conf 中禁用修剪
http.content.limit: -1
This should get the documents parsed with Tika.这应该得到用 Tika 解析的文档。 The resulting metadata will have a prefix parse.生成的元数据将具有前缀解析。 . . You might need to write a custom bolt to massage the data at the format you want ie one key per page in ES.您可能需要编写一个自定义螺栓以您想要的格式处理数据,即 ES 中每页一个键。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.