[英]Read csv from zipfile using pyspark
I am trying to read csv data from a zip file, i know that .gz files are supported naturally in spark.read.csv(), but this is a zip file 我正在尝试从zip文件中读取csv数据,我知道spark.read.csv()自然支持.gz文件,但这是一个zip文件
How to open/stream .zip files through Spark? 如何通过Spark打开/流.zip文件? I check the above question and tried using it, but not sure how parse the RDD (a whole file of csv data represented as a ROW of text) into to a CSV dataframe
我检查了上述问题并尝试使用它,但不确定如何将RDD(表示为ROW的csv数据的整个文件) 解析为CSV数据帧
This is the code section used to extract data to RDD 这是用于将数据提取到RDD的代码部分
import zipfile
import io
def zip_extract(x):
file_path, content = row
z_file = zipfile.ZipFile(io.BytesIO(content), "r")
files = [i for i in z_file.namelist()]
return z_file.open(files[0]).read()
zips = sc.binaryFiles("/path/to/some/zipfiles.zip")
data_rdd = zips.map(zip_extract)
Passing the rdd to spark.read.csv() is not giving the desired outcome 将rdd传递给spark.read.csv()并没有达到预期的结果
如果您已经有了RDD,不确定我是否理解正确,不是简单的调用data_rdd.toDF()
即可将其转换为DataFrame吗?
df=data_rdd.toDF()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.