[英]Create one dataframe from multi csv files with different headers in Spark
In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns. 在Spark中,使用Pyspark,我想创建一个数据框(该路径实际上是S3中的一个文件夹),其中包含具有公共列和不同列的多个csv文件。 To say it more easily, i want only one dataframe from multiple csv files with different headers. 简单地说,我只想从多个具有不同标题的csv文件中获取一个数据框。
I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility". 我可以有一个标题为“ raw_id,title,civility”的文件,以及另一个标题为“ raw_id,first_name,civility”的文件。
This is my code in python 3 : 这是我在python 3中的代码:
df = spark.read.load(
s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true',
quote = ''
)
This is an example of file_1.csv : 这是file_1.csv的示例:
|raw_id|title|civility|
|1 |M |male |
And an example of file2.csv : 还有一个file2.csv的示例:
|raw_id|first_name|civility|
|2 |Tom |male |
The result i expect in my dataframe is : 我期望在我的数据框中的结果是:
|raw_id|first_name|title|civility|
|1 | |M |male |
|2 |Tom | |male |
But, what is happening is that i have all united columns but the data is not in the right place after the first file. 但是,发生的事情是我拥有所有统一的列,但数据在第一个文件之后不在正确的位置。 Do you know how to do this ? 你知道怎么做吗?
Thank you very much by advance. 预先非常感谢您。
您需要将它们中的每一个加载到不同的数据框中,并将它们在raw_id列上连接在一起。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.