[英]How to join 3 DataFrames in to one DataFrame?
I am trying to join three different DataFrames together in to one but I am having trouble in joining all three. 我正在尝试将三个不同的DataFrame合并为一个,但是在合并所有三个时遇到了麻烦。 I have been able to join two. 我已经能够加入两个。
How can I join three DataFrames correctly? 如何正确加入三个DataFrame?
Pyspark: 1.6.0 Pyspark的:1.6.0
Below is my working so far: 以下是我到目前为止的工作:
# EXPECTED OUTPUT:
# -------file1.csv---------|---file2.csv--|---file3.csv------------|
# |col1|col2|col3|col4|col5|col1|col2|col3|col1|col2|col3|col4|col5|
# Loading in all the files
file1_rdd = sc.textFile("file1.csv").map(lambda line: line.split(","))
file2_rdd = sc.textFile("file2.csv").map(lambda line: line.split(","))
file3_rdd = sc.textFile("file3.csv").map(lambda line: line.split(","))
# Capturing the header
file1_header = file1_rdd.first()
file2_header = file2_rdd.first()
file3_header = file3_rdd.first()
# Removing the header from the table rows
df_file1 = file1_rdd.filter(lambda row : row != file1_header).toDF(file1_header)
df_file2 = file1_rdd.filter(lambda row : row != file2_header).toDF(file2_header)
df_file3 = file1_rdd.filter(lambda row : row != file3_header).toDF(file3_header)
# WORKS: df_file1.join(df_file2, df_file1.col1 == df_file2.col2)
# OUTPUT:
# -------file1.csv---------|---file2.csv--|
# |col1|col2|col3|col4|col5|col1|col2|col3|
# DOES NOT WORK: df_file1.join(df_file2, df_file1.col1 == df_file2.col2).join(df_file3, df_file2.col2 == df_file3.col2)
# OUTPUT:
# Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 5 values are provided.
Why is there an error that 4 fields are required when I can join the first two without that error even though the fields are not of the same length? 为什么即使我的字段长度不一样,我也可以在没有错误的情况下加入前两个字段时却需要4个字段呢?
The issue was that file3.csv
contained data which was not properly sanitised. 问题是file3.csv
包含未正确清理的数据。 In order to fix it I simply enforced a maximum split like so: 为了解决这个问题,我只是强制执行最大拆分,如下所示:
file3_rdd = sc.textFile("file3.csv").map(lambda line: line.split(",", 3))
To anybody reading who may have faced a similar issue: check that you can first view the tables independently without errors. 对于可能遇到类似问题的任何阅读者:请检查是否可以首先独立查看表且没有错误。 Doing df_file3.show()
returned the same error and would have helped me see the issue much faster. 进行df_file3.show()
返回相同的错误,并且可以帮助我更快地看到问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.