简体   繁体   中英

How to write a splittable DoFn in python - convert json to ndjson in apache beam

I have a large dataset in GCS in json format that I need to load into BigQuery. The problem is that the json data is not stored in NdJson but rather in a few large json files, where each key in the JSON should really be a field in json itself.

For example - the following Json:

{
  "johnny": {
    "type": "student"
  }, 
  "jeff": {
    "type": "teacher"
  }
}

should be converted into

[ 
  {
    "name": "johnny",
    "type": "student"
  }, 
  {
    "name": "jeff",
    "type": "teacher"
  }
]

I am trying to solve it via Google Data Flow an Apache Beam, but the performance is terrible since ech "Worker" has to do a lot of work:

class JsonToNdJsonDoFn(beam.DoFn):
    def __init__(self, pk_field_name):
        self.__pk_field_name = pk_field_name

    def process(self, line):
        for key, record in json.loads(line).items():
            record[self.__pk_field_name] = key
            yield record

I know that this can solved somehow via implementing it as a SplittableDoFn - but the implementation example in Python there is not really clear. How should I build this DoFn as splittable, and how will it be used as part of the pipeline?

You need a way to specify a partial range to process of the json file. It could be a byte range, for example.

The Avro example in the blog post is a good one. Something like:

class MyJsonReader(DoFn):
  def process(filename, tracker=DoFn.RestrictionTrackerParam)
    with fileio.ChannelFactory.open(filename) as file:
      start, stop = tracker.current_restriction()
      # Seek to the first block starting at or after the start offset.
      file.seek(start)
      next_record_start = find_next_record(file, start)
      while start:
        # Claim the position of the current record
        if not tracker.try_claim(next_record_start):
          # Out of range of the current restriction - we're done.
          return
        # start will point to the end of the record that was read
        record, start = read_record(file, next_record_start)
        yield record

  def get_initial_restriction(self, filename):
    return (0, fileio.ChannelFactory.size_in_bytes(filename))

However, json doesn't have clear record boundaries, so if your work has to start at byte 548, there's no clear way of telling how much to shift. If the file is literally what you have there, then you can skip bytes until you see the pattern "<string>": { . And then read the json object starting on the { .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM