[英]Read multilines json file (non-JSONL) with Apache Beam in Python
[英]How to read and manipulate a Json file with Apache beam in Python
我有一个 .txt 文件,它具有 JSON 格式。 我想读取、操作和重组文件(更改字段名称...) 如何使用 Apache Beam 在 Python 中执行此操作?
为了能够在 Python 上使用 Apache Beam 读取 Json 文件,您可以创建一个自定义编码器:
CF: https : //beam.apache.org/documentation/programming-guide/#specifying-coders
class JsonCoder(object):
"""A JSON coder interpreting each line as a JSON string."""
def encode(self, x):
return json.dumps(x)
def decode(self, x):
return json.loads(x)
然后您必须在读取或写入数据时指定它,例如:
lines = p | 'read_data' >> ReadFromText(known_args.input, coder=JsonCoder())
最好的问候,工作顺利;)
让我们考虑一下您有这样的示例数据:
{
"magic": "atMSG",
"type": "DT",
"headers": null,
"messageschemaid": null,
"messageschema": null,
"message": {
"data": {
"Column_Name_1": "data_in_quotes",
"Column_Name_2": "data_in_quotes",
"Column_Name_n": "data_in_quotes"
},
"beforeData": null,
"headers": {
"operation": "INSERT",
"changeSequence": "20200822230048000000000017887787417",
"timestamp": "2020-08-22T23:00:48.000",
"streamPosition": "00003EB9_0000000000000006_00000F4D9C6F8AFF01000001000CD387000C00580188000100000F4D9C333900",
"transactionId": "some_id"
}
}
}
并且您只想从以下位置读取数据: "message":{"data":{"Column_Name_1":"data_in_quotes", "Column_Name_2":"data_in_quotes", "Column_Name_n":"data_in_quotes"}
我使用以下代码读取这种类型的 NEWLINE_DELIMITED_JSON 并写入 bigquery:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
import json
from pandas.io.json import json_normalize
class Printer(beam.DoFn):
def process(self,data_item):
print (data_item)
def printer(data_item):
print (data_item)
class custom_json_parser(beam.DoFn):
def process(self, element):
norm = json_normalize(element, max_level=1)
l = norm["message.data"].to_list()
return l
table_schema = 'Column_name_1:Data_Type,Column_name_2:Data_Type,Column_name_n:Data_Type'
options = PipelineOptions()
p = beam.Pipeline(options=options)
projectId='your_project_id'
datasetId='Landing'
data_from_source = (p
| "READ FROM JSON" >> ReadFromText("gs://bucket/folder/file_name_having json data")
| "PARSE JSON" >> beam.Map(json.loads)
| "CUSTOM JOSN PARSE" >> beam.ParDo(custom_json_parser())
#| "PRINT DATA" >> beam.ParDo(Printer()) <- uncomment this line to see data onto GCP Dataflow Notebooks Console
#| WriteToText( "gs://ti-project-1/output/",".txt") <- to write it to text file
|"WriteToBigQuery" >> beam.io.WriteToBigQuery(
"{0}:{1}.table_name".format(projectId, datasetId),
schema=table_schema,
# write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
result = p.run()
上面的代码将做以下事情:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.