简体   繁体   English

apache_beam (python SDK) 是否支持.zip 压缩类型

[英]Does apache_beam (python SDK) support .zip compression type

I am implementing a batch pipeline with apache beam that decompress json files, pre-process them and store them back on a given location within the filesystem.我正在使用 apache 光束实现批处理管道,该光束解压缩 json 文件,对它们进行预处理并将它们存储回文件系统中的给定位置。

Files could be compressed using ZIP or GZIP algorithms..可以使用 Z4348F938BDDDD8475E967CCB47ECB234Z 或 GZIP 算法压缩文件。

The decompression is working well with GZIP files but it fails on ZIP files... After investigating, I found that only GZIP, BZIP2 and DEFLATE compression types are only supported within the JAVA SDK but no python implementation exists. The decompression is working well with GZIP files but it fails on ZIP files... After investigating, I found that only GZIP, BZIP2 and DEFLATE compression types are only supported within the JAVA SDK but no python implementation exists.

Is there a work around to solve this without patching the apache beam Python SDK?有没有办法解决这个问题而不修补 apache 光束 Python SDK?

Beam Python does not support ZIP.光束 Python 不支持 ZIP。 There are two workarounds: you can read the files in a DoFn, or you can use the Java SDK's file IO via a Cross-Language Transform .有两种解决方法:您可以读取 DoFn 中的文件,或者您可以通过跨语言转换使用 Java SDK 的文件 IO 。

The read-via-dofn approach would look something like read-via-dofn 方法看起来像

filenames
  | beam.Map(lambda f: (f, None))
  | beam.GroupByKey() # The GroupByKey adds a fusion break so that files can be processed in parallel
  | beam.Map(lambda f: f[0])
  | beam.FlatMap(lambda f: [line for line in read(f)]  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 apache_beam Python SDK 0.6.0 BigQuerySink是否支持表分区? - Does apache_beam Python SDK 0.6.0 BigQuerySink support table partitions? 创建自定义源以使用最新的python apache_beam cloud datafow sdk从云数据存储读取 - Creating custom source for reading from cloud datastore using latest python apache_beam cloud datafow sdk Apache Beam Python SDK - 支持 Python 中的 withAllowedLateness - Apache Beam Python SDK - support for withAllowedLateness in Python 从 Python 中的 apache_beam DoFn 以镶木地板格式写入 GCS - Write to GCS in parquet format from apache_beam DoFn in Python apache beam python SDK fileio.ReadMatches 是否支持压缩文件? - Does apache beam python SDK fileio.ReadMatches support compressed files? 在嵌入式 Flinkrunner (apache_beam [GCP]) 中使用 pub/sub io 运行光束流管道 (Python) 时出错 - Error while running beam streaming pipeline (Python) with pub/sub io in embedded Flinkrunner (apache_beam [GCP]) 导入错误:将 apache_beam 导入为梁。 未找到模块 - ImportError: import apache_beam as beam. Module not found Python支持Spark Beam中的SparkRunner - Python support for SparkRunner in Apache Beam Apache Beam Python SDK中的IntraBundleParalleization - IntraBundleParallelization in Apache Beam Python SDK Apache 光束 Python SDK ReadFromKafka 没有收到数据 - Apache Beam Python SDK ReadFromKafka does not receive data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM