[英]Does apache_beam (python SDK) support .zip compression type
I am implementing a batch pipeline with apache beam that decompress json files, pre-process them and store them back on a given location within the filesystem.我正在使用 apache 光束实现批处理管道,该光束解压缩 json 文件,对它们进行预处理并将它们存储回文件系统中的给定位置。
Files could be compressed using ZIP or GZIP algorithms..可以使用 Z4348F938BDDDD8475E967CCB47ECB234Z 或 GZIP 算法压缩文件。
The decompression is working well with GZIP files but it fails on ZIP files... After investigating, I found that only GZIP, BZIP2 and DEFLATE compression types are only supported within the JAVA SDK but no python implementation exists. The decompression is working well with GZIP files but it fails on ZIP files... After investigating, I found that only GZIP, BZIP2 and DEFLATE compression types are only supported within the JAVA SDK but no python implementation exists.
Is there a work around to solve this without patching the apache beam Python SDK?有没有办法解决这个问题而不修补 apache 光束 Python SDK?
Beam Python does not support ZIP.光束 Python 不支持 ZIP。 There are two workarounds: you can read the files in a DoFn, or you can use the Java SDK's file IO via a Cross-Language Transform .有两种解决方法:您可以读取 DoFn 中的文件,或者您可以通过跨语言转换使用 Java SDK 的文件 IO 。
The read-via-dofn approach would look something like read-via-dofn 方法看起来像
filenames
| beam.Map(lambda f: (f, None))
| beam.GroupByKey() # The GroupByKey adds a fusion break so that files can be processed in parallel
| beam.Map(lambda f: f[0])
| beam.FlatMap(lambda f: [line for line in read(f)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.