簡體   English   中英

StormCrawler 不解析 Tika 元數據

[英]StormCrawler not parsing Tika metadata

將 Tika 解析器添加到 StormCrawler 時,不會從該字段中提取任何信息並將其存儲在 ElasticSearch 中。

es-crawler.flux


includes:
  - resource: true
    file: "/crawler-default.yaml"
    override: false

  - resource: false
    file: "crawler-conf.yaml"
    override: true

  - resource: false
    file: "es-conf.yaml"
    override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds.txt"
      - true

bolts:
  - id: "filter"
    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    parallelism: 1
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 1
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 1
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 1
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 1
  - id: "tika_redirection"
    className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
    parallelism: 1
  - id: "tika_parser"
    className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
    parallelism: 1
  - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 1
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 1
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE
      
  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE     

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "tika_redirection"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "tika_redirection"
    to: "tika_parser"
    grouping:
      type: LOCAL_OR_SHUFFLE
      streamId: "tika"

  - from: "tika_parser"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "tika_parser"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filespout"
    to: "filter"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filter"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

我將這些設置添加到 crawler-conf.yaml:

爬蟲-conf.yaml

  parser.mimetype.whitelist:
    - application/.*pdf.*

  jsoup.treat.non.html.as.error: false

此外,我在運行拓撲時發現以下日志:

16:27:29.867 [Thread-43-tika_parser-executor[22, 22]] INFO  c.d.s.t.ParserBolt - skipped_trimmed -> http://cds.iisc.ac.in/wp-content/uploads/DS256.2017.Storm_.Tutorial.pdf

我更喜歡從 pdf 中提取所有可能的字段,並使用數組存儲頁面中的信息,因此,一頁成為 Elasticsearch 中數組中的一個元素。

請參閱ParserBolt - 如果在獲取期間修剪文檔,則不會進行解析。

您可以在 conf 中禁用修剪

  http.content.limit: -1

這應該得到用 Tika 解析的文檔。 生成的元數據將具有前綴解析。 . 您可能需要編寫一個自定義螺栓以您想要的格式處理數據,即 ES 中每頁一個鍵。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM