Read multiline JSON using apache beam / google cloud dataflow

Question

I am trying to read a JSON file (multi-line) in the pipeline but beam.io.ReadFromText(somefile.json reads one line at a time.

I am trying to read the content of the file as JSON so that I can apply map on each category to download relevant products file.

This is how my JSON file (productindex.json) looks like:

{
  "productcategories" : {
    "category1" : {
      "productfile" : "http://products.somestore.com/category1/products.json"
    },
    "category2" : {
      "productfile" : "http://products.somestore.com/category2/products.json"
    },
    "category3" : {
      "productfile" : "http://products.somestore.com/category3/products.json"
    },
    "category4" : {
      "productfile" : "http://products.somestore.com/category4/products.json"
    }
}

This is how the beginning of my pipeline looks like:

with beam.Pipeline(options=pipeline_options) as p:
    rows = (
        p | beam.io.ReadFromText(
            "http://products.somestore.com/allproducts/productindex.json")
    )

I am using apache-beam[gcp] module.

How do I achieve this?

Answer 1

Apache Beam / Cloud Dataflow does not directly support reading multi-line Json data.

The primary reason is that this is very hard to do in parallel. How does Beam know where each record ends? This is easy for a single reader, but very complicated for parallel readers.

The best solution that I can recommend is to convert your Json data into Newline-delimited Json (NDJSON) before processing by Beam / Dataflow. This may be as simple as changing the output format written by the upstream task or may require pre-processing.

Read multiline JSON using apache beam / google cloud dataflow

Question

1 answers

solution1
1 2019-02-19 19:31:01

Read multiline JSON using apache beam / google cloud dataflow

Question

1 answers

solution1 1 2019-02-19 19:31:01

solution1
1 2019-02-19 19:31:01