如何将3个DataFrame合并到一个DataFrame中？

Question

I am trying to join three different DataFrames together in to one but I am having trouble in joining all three. 我正在尝试将三个不同的DataFrame合并为一个，但是在合并所有三个时遇到了麻烦。 I have been able to join two. 我已经能够加入两个。

How can I join three DataFrames correctly? 如何正确加入三个DataFrame？

Pyspark: 1.6.0 Pyspark的：1.6.0

Below is my working so far: 以下是我到目前为止的工作：

# EXPECTED OUTPUT:

# -------file1.csv---------|---file2.csv--|---file3.csv------------|
# |col1|col2|col3|col4|col5|col1|col2|col3|col1|col2|col3|col4|col5|

# Loading in all the files
file1_rdd = sc.textFile("file1.csv").map(lambda line: line.split(","))
file2_rdd = sc.textFile("file2.csv").map(lambda line: line.split(","))
file3_rdd = sc.textFile("file3.csv").map(lambda line: line.split(","))

# Capturing the header
file1_header = file1_rdd.first()
file2_header = file2_rdd.first()
file3_header = file3_rdd.first()

# Removing the header from the table rows
df_file1 = file1_rdd.filter(lambda row : row != file1_header).toDF(file1_header)
df_file2 = file1_rdd.filter(lambda row : row != file2_header).toDF(file2_header)
df_file3 = file1_rdd.filter(lambda row : row != file3_header).toDF(file3_header)

# WORKS: df_file1.join(df_file2, df_file1.col1 == df_file2.col2)

# OUTPUT:
# -------file1.csv---------|---file2.csv--|
# |col1|col2|col3|col4|col5|col1|col2|col3|

# DOES NOT WORK: df_file1.join(df_file2, df_file1.col1 == df_file2.col2).join(df_file3, df_file2.col2 == df_file3.col2)

# OUTPUT:
# Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 5 values are provided.

Why is there an error that 4 fields are required when I can join the first two without that error even though the fields are not of the same length? 为什么即使我的字段长度不一样，我也可以在没有错误的情况下加入前两个字段时却需要4个字段呢？

Answer 1

The issue was that file3.csv contained data which was not properly sanitised. 问题是file3.csv包含未正确清理的数据。 In order to fix it I simply enforced a maximum split like so: 为了解决这个问题，我只是强制执行最大拆分，如下所示：

file3_rdd = sc.textFile("file3.csv").map(lambda line: line.split(",", 3))

To anybody reading who may have faced a similar issue: check that you can first view the tables independently without errors. 对于可能遇到类似问题的任何阅读者：请检查是否可以首先独立查看表且没有错误。 Doing df_file3.show() returned the same error and would have helped me see the issue much faster. 进行df_file3.show()返回相同的错误，并且可以帮助我更快地看到问题。

如何将3个DataFrame合并到一个DataFrame中？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-12-05 23:30:38

如何将3个DataFrame合并到一个DataFrame中？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-12-05 23:30:38

解决方案1
0 已采纳 2017-12-05 23:30:38