简体   繁体   English

尝试在 Databricks 环境中合并或连接两个 pyspark.sql.dataframe.DataFrame

[英]Trying to Merge or Concat two pyspark.sql.dataframe.DataFrame in Databricks Environment

I have two dataframes in Azure Databricks.我在 Azure Databricks 中有两个数据帧。 Both are of type: pyspark.sql.dataframe.DataFrame两者都是类型:pyspark.sql.dataframe.DataFrame

The number of rows are the same;行数相同; indexes are the same.索引是一样的。 I thought one of these code snippets, below, would do the job.我认为下面的这些代码片段之一可以完成这项工作。

First Attempt:第一次尝试:

result = pd.concat([df1, df2], axis=1)


Error Message: TypeError: cannot concatenate object of type "<class 'pyspark.sql.dataframe.DataFrame'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

Second Attempt:第二次尝试:

result = pd.merge(df1, df2, left_index=True, right_index=True)

Error Message:  TypeError: Can only merge Series or DataFrame objects, a <class 'pyspark.sql.dataframe.DataFrame'> was passed

I ended up converting the two objects to pandas dataframes and then did the merge using the technique I know how to use.我最终将这两个对象转换为 Pandas 数据帧,然后使用我知道如何使用的技术进行合并。

Step #1:第1步:

df1= df1.select("*").toPandas()
df2= df2.select("*").toPandas()

Step #2:第2步:

result = pd.concat([df1, df2], axis=1)

Done!完毕!

I faced similar issue when combining two dataframes of same columns.组合相同列的两个数据框时,我遇到了类似的问题。

df = pd.concat([df, resultant_df], ignore_index=True)
TypeError: cannot concatenate object of type '<class 'pyspark.sql.dataframe.DataFrame'>'; only Series and DataFrame objs are valid

Then I tried join(), but it appends columns multiple times and returns empty dataframe.然后我尝试了 join(),但它多次追加列并返回空数据帧。

df.join(resultant_df)

After that I used union(), gets the exact result.之后我使用了 union(),得到了确切的结果。

df = df.union(resultant_df)
df.show()

It works fine in my case.在我的情况下它工作正常。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将pyspark.sql.dataframe.DataFrame转换回databricks笔记本中的sql表 - How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook Pyspark:依靠 pyspark.sql.dataframe.DataFrame 需要很长时间 - Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time 将pyspark.sql.dataframe.DataFrame类型转换为Dictionary - Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary 写一个pyspark.sql.dataframe.DataFrame不丢失信息 - Write a pyspark.sql.dataframe.DataFrame without losing information difference between pyspark.pandas.frame.DataFrame and pyspark.sql.dataframe.DataFrame and their conversion - difference between pyspark.pandas.frame.DataFrame and pyspark.sql.dataframe.DataFrame and their conversion Pyspark:如何从 pyspark.sql.dataframe.DataFrame 中选择唯一的 ID 数据? - Pyspark: how to select unique ID data from a pyspark.sql.dataframe.DataFrame? Pyspark:如何将在线.gz日志文件加载到pyspark.sql.dataframe.DataFrame中 - Pyspark: how to load online .gz log file into pyspark.sql.dataframe.DataFrame 如何在不使用 pandas on spark API 的情况下为 pyspark.sql.dataframe.DataFrame 编写这个 pandas 逻辑? - How to write this pandas logic for pyspark.sql.dataframe.DataFrame without using pandas on spark API? 连接或合并两个 dataframe pandas - Concat, or merge two dataframe pandas 通过 cols 合并/连接两个 dataframe - Merge/concat two dataframe by cols
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM