[英]How to read and manipulate a Json file with Apache beam in Python
[英]How to read and manipulate a txt file into json with Apache beam in Python
我有一个.txt 文件,其中包含某种数据碎片。 我想读取、操作文件并将其重组为 json 格式,如何使用 Apache Beam 在 Python 中执行此操作?
txt文件是这样的。
IDX|99214764|085500|00010541|1|084500|1|ALSX |SG | |00000016325.00|000000000500|000000000500|D|000000006385|00000014400.00|000000004600|00000014425.00|000000000600|000000000c7\\
IDX|70120724|085500|00010542|1|084500|1|IDFL |LG | |00000007100.00|000000000800|000000000800|D|000000006386|00000006625.00|000000010400|00000006650.00|000000027800|0000000'00ff00'0|
已经尝试过类似的方法但没有用
import apache_beam as beam
import re
with beam.Pipeline() as pipe:
#convert txt to json with beam apache
header = (pipe
| 'Read' >> beam.io.ReadFromText('DLS.txt', skip_header_lines=(9))
| 'Find words' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
| 'beam.Filter' >> beam.Filter(lambda x: x != '|')
| 'Write' >> beam.io.WriteToText('DESS.json',
file_name_suffix='',
num_shards=1,
shard_name_template=''))
您绝对可以使用 Beam 将文本转换为 json,但是有点不清楚您希望 json output 看起来像什么。 假设您的输入是以下形式
[nine header lines]
apples | red | $1
bananas | yellow | $3
...
你可以做类似的事情
def line_to_dict(line):
name, color, price = line.split('|')
return {"name": name.strip(), "color": color.strip(), "price", price.strip()}
with beam.Pipeline() as pipe:
header = (pipe
| 'Read' >> beam.io.ReadFromText('fruit.txt', skip_header_lines=9)
| 'ConvertToDict' >> beam.Map(line_to_dict)
| 'FormatAsJson' >> beam.Map(json.dumps)
| 'Write' >> beam.io.WriteToText('fruit.json',
file_name_suffix='',
num_shards=1,
shard_name_template=''))
这将导致文件看起来像
{"name": apple, "color": red, "price": "$1"}
{"name": banana, "color": yellow, "price": "$3"}
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.