使用pyspark从zip文件读取csv

Question

I am trying to read csv data from a zip file, i know that .gz files are supported naturally in spark.read.csv(), but this is a zip file 我正在尝试从zip文件中读取csv数据，我知道spark.read.csv（）自然支持.gz文件，但这是一个zip文件

How to open/stream .zip files through Spark? 如何通过Spark打开/流.zip文件？ I check the above question and tried using it, but not sure how parse the RDD (a whole file of csv data represented as a ROW of text) into to a CSV dataframe 我检查了上述问题并尝试使用它，但不确定如何将RDD（表示为ROW的csv数据的整个文件）解析为CSV数据帧

This is the code section used to extract data to RDD 这是用于将数据提取到RDD的代码部分

import zipfile
import io

def zip_extract(x):
  file_path, content = row
  z_file = zipfile.ZipFile(io.BytesIO(content), "r")
  files = [i for i in z_file.namelist()]
  return z_file.open(files[0]).read()


zips = sc.binaryFiles("/path/to/some/zipfiles.zip")
data_rdd = zips.map(zip_extract)

Passing the rdd to spark.read.csv() is not giving the desired outcome 将rdd传递给spark.read.csv（）并没有达到预期的结果

Answer 1

如果您已经有了RDD，不确定我是否理解正确，不是简单的调用data_rdd.toDF()即可将其转换为DataFrame吗？

df=data_rdd.toDF()

使用pyspark从zip文件读取csv

问题描述

1 个解决方案

解决方案1
1 2019-07-22 21:38:26

使用pyspark从zip文件读取csv

问题描述

1 个解决方案

解决方案1 1 2019-07-22 21:38:26

解决方案1
1 2019-07-22 21:38:26