使用 apache 梁/谷歌云数据流读取多行 JSON

Question

我正在尝试读取管道中的 JSON 文件（多行），但beam.io.ReadFromText(somefile.json读取一行。

我正在尝试将文件内容读取为 JSON，以便我可以在每个类别上应用map以下载相关产品文件。

这是我的JSON文件(productindex.json) 的样子：

{
  "productcategories" : {
    "category1" : {
      "productfile" : "http://products.somestore.com/category1/products.json"
    },
    "category2" : {
      "productfile" : "http://products.somestore.com/category2/products.json"
    },
    "category3" : {
      "productfile" : "http://products.somestore.com/category3/products.json"
    },
    "category4" : {
      "productfile" : "http://products.somestore.com/category4/products.json"
    }
}

这是我的管道开头的样子：

with beam.Pipeline(options=pipeline_options) as p:
    rows = (
        p | beam.io.ReadFromText(
            "http://products.somestore.com/allproducts/productindex.json")
    )

我正在使用apache-beam[gcp]模块。

我如何实现这一目标？

Answer 1

Apache Beam / Cloud Dataflow 不直接支持读取多行 Json 数据。

主要原因是这很难并行执行。 Beam 如何知道每条记录的结束位置？ 这对于单个阅读器来说很容易，但对于并行阅读器来说非常复杂。

我可以推荐的最佳解决方案是在 Beam / Dataflow 处理之前将您的 Json 数据转换为换行符分隔的 Json (NDJSON)。 这可能就像更改上游任务写入的输出格式一样简单，也可能需要预处理。

使用 apache 梁/谷歌云数据流读取多行 JSON

问题描述

1 个解决方案

解决方案1
1 2019-02-19 19:31:01

使用 apache 梁/谷歌云数据流读取多行 JSON

问题描述

1 个解决方案

解决方案1 1 2019-02-19 19:31:01

解决方案1
1 2019-02-19 19:31:01