简体   繁体   English

从Spark中具有不同标头的多个csv文件创建一个数据帧

[英]Create one dataframe from multi csv files with different headers in Spark

In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns. 在Spark中,使用Pyspark,我想创建一个数据框(该路径实际上是S3中的一个文件夹),其中包含具有公共列和不同列的多个csv文件。 To say it more easily, i want only one dataframe from multiple csv files with different headers. 简单地说,我只想从多个具有不同标题的csv文件中获取一个数据框。

I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility". 我可以有一个标题为“ raw_id,title,civility”的文件,以及另一个标题为“ raw_id,first_name,civility”的文件。

This is my code in python 3 : 这是我在python 3中的代码:

df = spark.read.load(
    s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
    format = 'csv',
    delimiter = '|',
    encoding = 'utf-8',
    header = 'true',
    quote = ''
)

This is an example of file_1.csv : 这是file_1.csv的示例:

|raw_id|title|civility|
|1     |M    |male    |

And an example of file2.csv : 还有一个file2.csv的示例:

|raw_id|first_name|civility|
|2     |Tom       |male    |

The result i expect in my dataframe is : 我期望在我的数据框中的结果是:

|raw_id|first_name|title|civility|
|1     |          |M    |male    |
|2     |Tom       |     |male    |

But, what is happening is that i have all united columns but the data is not in the right place after the first file. 但是,发生的事情是我拥有所有统一的列,但数据在第一个文件之后不在正确的位置。 Do you know how to do this ? 你知道怎么做吗?

Thank you very much by advance. 预先非常感谢您。

您需要将它们中的每一个加载到不同的数据框中,并将它们在raw_id列上连接在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM