繁体   English   中英

如何在 Python 中使用 Apache Beam 读取和操作 Json 文件

[英]How to read and manipulate a Json file with Apache beam in Python

我有一个 .txt 文件,它具有 JSON 格式。 我想读取、操作和重组文件(更改字段名称...) 如何使用 Apache Beam 在 Python 中执行此操作?

为了能够在 Python 上使用 Apache Beam 读取 Json 文件,您可以创建一个自定义编码器:

CF: https : //beam.apache.org/documentation/programming-guide/#specifying-coders

class JsonCoder(object):
"""A JSON coder interpreting each line as a JSON string."""

def encode(self, x):
    return json.dumps(x)

def decode(self, x):
    return json.loads(x)

然后您必须在读取或写入数据时指定它,例如:

lines = p | 'read_data' >> ReadFromText(known_args.input, coder=JsonCoder())

最好的问候,工作顺利;)

让我们考虑一下您有这样的示例数据:

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageschemaid": null,
    "messageschema": null,
    "message": {
        "data": {
            "Column_Name_1": "data_in_quotes",
            "Column_Name_2": "data_in_quotes",
            "Column_Name_n": "data_in_quotes"
        },
        "beforeData": null,
        "headers": {
            "operation": "INSERT",
            "changeSequence": "20200822230048000000000017887787417",
            "timestamp": "2020-08-22T23:00:48.000",
            "streamPosition": "00003EB9_0000000000000006_00000F4D9C6F8AFF01000001000CD387000C00580188000100000F4D9C333900",
            "transactionId": "some_id"
        }
    }
}

并且您只想从以下位置读取数据: "message":{"data":{"Column_Name_1":"data_in_quotes", "Column_Name_2":"data_in_quotes", "Column_Name_n":"data_in_quotes"}

我使用以下代码读取这种类型的 NEWLINE_DELIMITED_JSON 并写入 bigquery:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
import json  
from pandas.io.json import json_normalize

class Printer(beam.DoFn):
 def process(self,data_item):
    print (data_item)
def printer(data_item):
    print (data_item)
    
class custom_json_parser(beam.DoFn):
    def process(self, element):
        norm = json_normalize(element, max_level=1)
        l = norm["message.data"].to_list()
        return l
table_schema = 'Column_name_1:Data_Type,Column_name_2:Data_Type,Column_name_n:Data_Type'

options = PipelineOptions()
p = beam.Pipeline(options=options)

projectId='your_project_id'
datasetId='Landing'

data_from_source = (p
                    | "READ FROM JSON" >> ReadFromText("gs://bucket/folder/file_name_having json data")
                    | "PARSE JSON" >> beam.Map(json.loads)
                    | "CUSTOM JOSN PARSE" >> beam.ParDo(custom_json_parser())
                    #| "PRINT DATA" >> beam.ParDo(Printer()) <- uncomment this line to see data onto GCP Dataflow Notebooks Console
                    #| WriteToText( "gs://ti-project-1/output/",".txt") <- to write it to text file
                    |"WriteToBigQuery" >> beam.io.WriteToBigQuery(
                        "{0}:{1}.table_name".format(projectId, datasetId),
                        schema=table_schema,
                    #    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                        write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
                    )
)

result = p.run()

上面的代码将做以下事情:

  1. 读取并解析 JSON 文件
  2. 在 GCP BigQuery 中创建表
  3. 以 TRUNCATE 模式将数据加载到表中。 对于追加,注释 WRITE_TRUNCATE 并取消注释 WRITE_APPEND

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM