简体   繁体   中英

Read multiline JSON using apache beam / google cloud dataflow

I am trying to read a JSON file (multi-line) in the pipeline but beam.io.ReadFromText(somefile.json reads one line at a time.

I am trying to read the content of the file as JSON so that I can apply map on each category to download relevant products file.

This is how my JSON file (productindex.json) looks like:

{
  "productcategories" : {
    "category1" : {
      "productfile" : "http://products.somestore.com/category1/products.json"
    },
    "category2" : {
      "productfile" : "http://products.somestore.com/category2/products.json"
    },
    "category3" : {
      "productfile" : "http://products.somestore.com/category3/products.json"
    },
    "category4" : {
      "productfile" : "http://products.somestore.com/category4/products.json"
    }
}

This is how the beginning of my pipeline looks like:

with beam.Pipeline(options=pipeline_options) as p:
    rows = (
        p | beam.io.ReadFromText(
            "http://products.somestore.com/allproducts/productindex.json")
    )

I am using apache-beam[gcp] module.

How do I achieve this?

Apache Beam / Cloud Dataflow does not directly support reading multi-line Json data.

The primary reason is that this is very hard to do in parallel. How does Beam know where each record ends? This is easy for a single reader, but very complicated for parallel readers.

The best solution that I can recommend is to convert your Json data into Newline-delimited Json (NDJSON) before processing by Beam / Dataflow. This may be as simple as changing the output format written by the upstream task or may require pre-processing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM