简体   繁体   English

如何将3个DataFrame合并到一个DataFrame中?

[英]How to join 3 DataFrames in to one DataFrame?

I am trying to join three different DataFrames together in to one but I am having trouble in joining all three. 我正在尝试将三个不同的DataFrame合并为一个,但是在合并所有三个时遇到了麻烦。 I have been able to join two. 我已经能够加入两个。

How can I join three DataFrames correctly? 如何正确加入三个DataFrame?

Pyspark: 1.6.0 Pyspark的:1.6.0

Below is my working so far: 以下是我到目前为止的工作:

# EXPECTED OUTPUT:

# -------file1.csv---------|---file2.csv--|---file3.csv------------|
# |col1|col2|col3|col4|col5|col1|col2|col3|col1|col2|col3|col4|col5|

# Loading in all the files
file1_rdd = sc.textFile("file1.csv").map(lambda line: line.split(","))
file2_rdd = sc.textFile("file2.csv").map(lambda line: line.split(","))
file3_rdd = sc.textFile("file3.csv").map(lambda line: line.split(","))

# Capturing the header
file1_header = file1_rdd.first()
file2_header = file2_rdd.first()
file3_header = file3_rdd.first()

# Removing the header from the table rows
df_file1 = file1_rdd.filter(lambda row : row != file1_header).toDF(file1_header)
df_file2 = file1_rdd.filter(lambda row : row != file2_header).toDF(file2_header)
df_file3 = file1_rdd.filter(lambda row : row != file3_header).toDF(file3_header)

# WORKS: df_file1.join(df_file2, df_file1.col1 == df_file2.col2)

# OUTPUT:
# -------file1.csv---------|---file2.csv--|
# |col1|col2|col3|col4|col5|col1|col2|col3|

# DOES NOT WORK: df_file1.join(df_file2, df_file1.col1 == df_file2.col2).join(df_file3, df_file2.col2 == df_file3.col2)

# OUTPUT:
# Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 5 values are provided.

Why is there an error that 4 fields are required when I can join the first two without that error even though the fields are not of the same length? 为什么即使我的字段长度不一样,我也可以在没有错误的情况下加入前两个字段时却需要4个字段呢?

The issue was that file3.csv contained data which was not properly sanitised. 问题是file3.csv包含未正确清理的数据。 In order to fix it I simply enforced a maximum split like so: 为了解决这个问题,我只是强制执行最大拆分,如下所示:

file3_rdd = sc.textFile("file3.csv").map(lambda line: line.split(",", 3))

To anybody reading who may have faced a similar issue: check that you can first view the tables independently without errors. 对于可能遇到类似问题的任何阅读者:请检查是否可以首先独立查看表且没有错误。 Doing df_file3.show() returned the same error and would have helped me see the issue much faster. 进行df_file3.show()返回相同的错误,并且可以帮助我更快地看到问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将两个数据帧中的一列连接到另一个数据帧? pd.merge 返回 nan - How to join one column from two dataframes to another dataframe ? pd.merge returns nan 当一个 dataframe 中只有一些日期存在于其他 dataframe 中的其他两个日期之间时,如何连接两个数据框? - How to join two dataframes when only some dates in one dataframe is present between two other dates in other dataframe? 如何将数据框的字典更改为具有多列的 dataframe - how to change dict of dataframes into one dataframe with multicolumns 如何根据熊猫中的第三个数据框联接两个数据框? - how to join two dataframes according to third dataframe in pandas? 如何重新排列单独的数据帧并将它们加入单个摘要 dataframe? - How to rearrange separate dataframes and join them into a single summary dataframe? pyspark 连接两个数据帧以在一个 dataframe 中找到另一个 dataframe 的映射词 - pyspark join two dataframes to find the mapped words in one dataframe for another dataframe 如何创建一个包含 2 个数据帧的 Pandas 数据帧,一个作为列,一个作为行 - How to create a pandas dataframe with 2 dataframes one as columns and one as rows 如何通过一个以上的键连接两个数据框? - How to join two dataframes by more than one key? SQL/Pandas 在其中一个表/DataFrame 中包含重复项的列上加入 Table/DataFrame - SQL/Pandas Join Table/DataFrame on Column that Contains Duplicates in one of the Tables/DataFrames 将来自多个pandas数据框的所有列连接到一个包含数据和列名称的数据框 - Join all columns from multiple pandas dataframes into one dataframe with data and column names
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM