简体   繁体   English

如何使用 Python Avro api 将 JSON 解析为二进制 Avro 文件?

[英]How can I parse JSON into a binary Avro file using the Python Avro api?

I am able to use the avro-tools-1.7.7.jar to take json data and avro schema and output a binary Avro file as shown here https://github.com/miguno/avro-cli-examples#json-to-avro .我能够使用 avro-tools-1.7.7.jar 来获取 json 数据和 avro 模式并输出一个二进制 Avro 文件,如下所示https://github.com/miguno/avro-cli-examples#json-to -avro However, I want to be able to do this programmatically using the Avro python api: https://avro.apache.org/docs/1.7.7/gettingstartedpython.html .但是,我希望能够使用 Avro python api 以编程方式执行此操作: https : //avro.apache.org/docs/1.7.7/gettingstartedpython.html

In their example they show how you can write a record at a time into a binary avro file.在他们的示例中,他们展示了如何一次将记录写入二进制 avro 文件。

    import avro.schema
    from avro.datafile import DataFileReader, DataFileWriter
    from avro.io import DatumReader, DatumWriter

    schema = avro.schema.parse(open("user.avsc").read())

    writer = DataFileWriter(open("users.avro", "w"), DatumWriter(), schema)
    writer.append({"name": "Alyssa", "favorite_number": 256})
    writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
    writer.close()

My use case is writing all of the records at once like the avro-tools jar does from a json file, just in python code.我的用例是一次写入所有记录,就像 avro-tools jar 从 json 文件中所做的那样,只是在 python 代码中。 I do not want to shell out and execute the jar.我不想掏空并执行jar。 This will be deployed to Google App Engine if that matters.如果这很重要,这将部署到 Google App Engine。

This can be accomplished with fastavro .这可以通过fastavro来完成。 For example, given the schema in the link:例如,给定链接中的架构:

twitter.avsc推特.avsc

{
  "type" : "record",
  "name" : "twitter_schema",
  "namespace" : "com.miguno.avro",
  "fields" : [ {
    "name" : "username",
    "type" : "string",
    "doc" : "Name of the user account on Twitter.com"
  }, {
    "name" : "tweet",
    "type" : "string",
    "doc" : "The content of the user's Twitter message"
  }, {
    "name" : "timestamp",
    "type" : "long",
    "doc" : "Unix epoch time in seconds"
  } ],
  "doc:" : "A basic schema for storing Twitter messages"
}

And the json file:和 json 文件:

twitter.json推特.json

{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended.  Terran is IMBA.","timestamp": 1366154481 }

You can use something like the following script to write out an avro file:您可以使用类似于以下脚本的内容来写出一个 avro 文件:

import json
from fastavro import json_reader, parse_schema, writer

with open("twitter.avsc") as fp:
    schema = parse_schema(json.load(fp))

with open("twitter.avro", "wb") as avro_file:
    with open("twitter.json") as fp:
        writer(avro_file, schema, json_reader(fp, schema))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM