简体   繁体   English

使用pyspark从zip文件读取csv

[英]Read csv from zipfile using pyspark

I am trying to read csv data from a zip file, i know that .gz files are supported naturally in spark.read.csv(), but this is a zip file 我正在尝试从zip文件中读取csv数据,我知道spark.read.csv()自然支持.gz文件,但这是一个zip文件

How to open/stream .zip files through Spark? 如何通过Spark打开/流.zip文件? I check the above question and tried using it, but not sure how parse the RDD (a whole file of csv data represented as a ROW of text) into to a CSV dataframe 我检查了上述问题并尝试使用它,但不确定如何将RDD(表示为ROW的csv数据的整个文件) 解析CSV数据帧

This is the code section used to extract data to RDD 这是用于将数据提取到RDD的代码部分

import zipfile
import io

def zip_extract(x):
  file_path, content = row
  z_file = zipfile.ZipFile(io.BytesIO(content), "r")
  files = [i for i in z_file.namelist()]
  return z_file.open(files[0]).read()


zips = sc.binaryFiles("/path/to/some/zipfiles.zip")
data_rdd = zips.map(zip_extract)

Passing the rdd to spark.read.csv() is not giving the desired outcome 将rdd传递给spark.read.csv()并没有达到预期的结果

如果您已经有了RDD,不确定我是否理解正确,不是简单的调用data_rdd.toDF()即可将其转换为DataFrame吗?

df=data_rdd.toDF()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM