繁体   English   中英

如何使用Python中的Apache光束将txt文件读取和操作到json中

[英]How to read and manipulate a txt file into json with Apache beam in Python

我有一个.txt 文件,其中包含某种数据碎片。 我想读取、操作文件并将其重组为 json 格式,如何使用 Apache Beam 在 Python 中执行此操作?

txt文件是这样的。

IDX|99214764|085500|00010541|1|084500|1|ALSX |SG | |00000016325.00|000000000500|000000000500|D|000000006385|00000014400.00|000000004600|00000014425.00|000000000600|000000000c7\\

IDX|70120724|085500|00010542|1|084500|1|IDFL |LG | |00000007100.00|000000000800|000000000800|D|000000006386|00000006625.00|000000010400|00000006650.00|000000027800|0000000'00ff00'0|

已经尝试过类似的方法但没有用

import apache_beam as beam
import re
with beam.Pipeline() as pipe:
    #convert txt to json with beam apache
    header = (pipe
        | 'Read' >> beam.io.ReadFromText('DLS.txt', skip_header_lines=(9))
        | 'Find words' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
        | 'beam.Filter' >> beam.Filter(lambda x: x != '|')
        | 'Write' >> beam.io.WriteToText('DESS.json',
                                            file_name_suffix='',
                                            num_shards=1,
                                            shard_name_template=''))

您绝对可以使用 Beam 将文本转换为 json,但是有点不清楚您希望 json output 看起来像什么。 假设您的输入是以下形式

[nine header lines]
apples | red | $1
bananas | yellow | $3
...

你可以做类似的事情

def line_to_dict(line):
    name, color, price = line.split('|')
    return {"name": name.strip(), "color": color.strip(), "price", price.strip()}

with beam.Pipeline() as pipe:
    header = (pipe
        | 'Read' >> beam.io.ReadFromText('fruit.txt', skip_header_lines=9)
        | 'ConvertToDict' >> beam.Map(line_to_dict)
        | 'FormatAsJson' >> beam.Map(json.dumps)
        | 'Write' >> beam.io.WriteToText('fruit.json',
                                            file_name_suffix='',
                                            num_shards=1,
                                            shard_name_template=''))

这将导致文件看起来像

{"name": apple, "color": red, "price": "$1"}
{"name": banana, "color": yellow, "price": "$3"}
...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM