[英]Trying to Merge or Concat two pyspark.sql.dataframe.DataFrame in Databricks Environment
I have two dataframes in Azure Databricks.我在 Azure Databricks 中有两个数据帧。 Both are of type: pyspark.sql.dataframe.DataFrame两者都是类型:pyspark.sql.dataframe.DataFrame
The number of rows are the same;行数相同; indexes are the same.索引是一样的。 I thought one of these code snippets, below, would do the job.我认为下面的这些代码片段之一可以完成这项工作。
First Attempt:第一次尝试:
result = pd.concat([df1, df2], axis=1)
Error Message: TypeError: cannot concatenate object of type "<class 'pyspark.sql.dataframe.DataFrame'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Second Attempt:第二次尝试:
result = pd.merge(df1, df2, left_index=True, right_index=True)
Error Message: TypeError: Can only merge Series or DataFrame objects, a <class 'pyspark.sql.dataframe.DataFrame'> was passed
I ended up converting the two objects to pandas dataframes and then did the merge using the technique I know how to use.我最终将这两个对象转换为 Pandas 数据帧,然后使用我知道如何使用的技术进行合并。
Step #1:第1步:
df1= df1.select("*").toPandas()
df2= df2.select("*").toPandas()
Step #2:第2步:
result = pd.concat([df1, df2], axis=1)
Done!完毕!
I faced similar issue when combining two dataframes of same columns.组合相同列的两个数据框时,我遇到了类似的问题。
df = pd.concat([df, resultant_df], ignore_index=True)
TypeError: cannot concatenate object of type '<class 'pyspark.sql.dataframe.DataFrame'>'; only Series and DataFrame objs are valid
Then I tried join(), but it appends columns multiple times and returns empty dataframe.然后我尝试了 join(),但它多次追加列并返回空数据帧。
df.join(resultant_df)
After that I used union(), gets the exact result.之后我使用了 union(),得到了确切的结果。
df = df.union(resultant_df)
df.show()
It works fine in my case.在我的情况下它工作正常。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.