[英]Unzipping files with Apache Beam (Python) but when using WriteToText it puts all columns as lines
I am very new to programming and Apache Beam, and I am trying to read plenty zip files on aa GCS bucket and unzip them and save again as csv on GCS.我对编程和 Apache Beam 非常陌生,我正在尝试在 GCS 存储桶上读取大量 zip 文件并将它们解压缩并再次保存为 GCS 上的 csv。
with beam.Pipeline() as pipeline:
readable_files = (
pipeline
| beam.io.fileio.MatchFiles('path/file/patter*.zip')
| beam.io.fileio.ReadMatches()
| beam.FlatMap(unzip)
| beam.combiners.ToList())
files_and_contents = (
readable_files
| beam.io.WriteToText('new', file_name_suffix='.csv'))
An I am unzipping the files with this function我正在使用此功能解压缩文件
def unzip(readable_file):
print(readable_file)
input_zip=zipfile.ZipFile(readable_file.open())
yield {name: input_zip.read(name) for name in input_zip.namelist()}
I have tested it with two files only, and all lines were written as columns, here is an example.我仅用两个文件对其进行了测试,并且所有行都写为列,这是一个示例。 The header is a column, and all the other lines columns.标题是一列,所有其他行都是列。
在 beam.io.file io.ReadMatches() 内尝试添加 skip_header_lines=1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.