从Spark中具有不同标头的多个csv文件创建一个数据帧

Question

In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns. 在Spark中，使用Pyspark，我想创建一个数据框（该路径实际上是S3中的一个文件夹），其中包含具有公共列和不同列的多个csv文件。 To say it more easily, i want only one dataframe from multiple csv files with different headers. 简单地说，我只想从多个具有不同标题的csv文件中获取一个数据框。

I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility". 我可以有一个标题为“ raw_id，title，civility”的文件，以及另一个标题为“ raw_id，first_name，civility”的文件。

This is my code in python 3 : 这是我在python 3中的代码：

df = spark.read.load(
    s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
    format = 'csv',
    delimiter = '|',
    encoding = 'utf-8',
    header = 'true',
    quote = ''
)

This is an example of file_1.csv : 这是file_1.csv的示例：

|raw_id|title|civility|
|1     |M    |male    |

And an example of file2.csv : 还有一个file2.csv的示例：

|raw_id|first_name|civility|
|2     |Tom       |male    |

The result i expect in my dataframe is : 我期望在我的数据框中的结果是：

|raw_id|first_name|title|civility|
|1     |          |M    |male    |
|2     |Tom       |     |male    |

But, what is happening is that i have all united columns but the data is not in the right place after the first file. 但是，发生的事情是我拥有所有统一的列，但数据在第一个文件之后不在正确的位置。 Do you know how to do this ? 你知道怎么做吗？

Thank you very much by advance. 预先非常感谢您。

Answer 1

您需要将它们中的每一个加载到不同的数据框中，并将它们在raw_id列上连接在一起。

从Spark中具有不同标头的多个csv文件创建一个数据帧

问题描述

1 个解决方案

解决方案1
0 2019-01-16 16:00:04

从Spark中具有不同标头的多个csv文件创建一个数据帧

问题描述

1 个解决方案

解决方案1 0 2019-01-16 16:00:04

解决方案1
0 2019-01-16 16:00:04