简体   繁体   English

混合模式 CSV 导入 Pyspark

[英]Mixed Schema CSV Import Pyspark

I have a folder of CSV files that I want to read into a dataframe.我有一个 CSV 文件的文件夹,我想将其读入 dataframe。 The problem is that While all contain the set of columns I need, some of them also contain other columns.问题是,虽然它们都包含我需要的一组列,但其中一些还包含其他列。 So for every CSV in the folder I want to only read in the common set of columns that I need.因此,对于文件夹中的每个 CSV,我只想读取我需要的常用列集。

For example:例如:

Sheet 1 contains the columns:表 1 包含以下列:

Column 1, Column 2, Column 3, X第 1 列,第 2 列,第 3 列,X

Sheet 2 contains the columns:表 2 包含以下列:

Column 1, Column 2, Column 3第 1 栏、第 2 栏、第 3 栏

I only need Column 1, Column 2, and Column 3. Is it possible to take care of that on read or do I need to read them in separately and then select the appropriate columns and append them together.我只需要第 1 列、第 2 列和第 3 列。是否可以在读取时处理它们,或者我是否需要单独读取它们,然后将 select 适当的列和 append 一起读取。

Try with for loop over all files in the directory and get only the required columns read from the file.尝试使用for loop over all files ,并仅获取从文件中读取的所需列。

Example:

#files path list
file_lst=['<path1>','<path2>']

from pyspark.sql.functions import *
from pyspark.sql.types import *

#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])

#create an empty dataframe
df=spark.createDataFrame([],schema)

#loop through files with reading header from the file then select only req cols
#union all dataframes

for i in file_lst:
    tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
    df=df.unionAll(tmp_df)

#display results
df.show()

In case if your files in the directory have column1,column2,column3..etc(required columns) in specific order over all the files then you can try as below:如果您的目录中的文件在所有文件中按特定顺序包含column1,column2,column3..etc(required columns) ,那么您可以尝试如下:

spark.read.option("header","true").csv("<directory>").select("column1","column2","column3").show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM