Pyspark自联接，错误为“缺少已解决的属性”

Question

While doing a pyspark dataframe self-join I got a error message: 在执行pyspark数据帧自联接时，出现错误消息：

Py4JJavaError: An error occurred while calling o1595.join.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286 in operator !Project [...]. Attribute(s) with the same name appear in the operation: un_val. Please check if the right attribute(s) are used.;;

It is a simple dataframe self-join like the one below, that works fine but after a couple of operations on the dataframe like adding columns or joining with other dataframes the error mentioned above is raised. 这是一个简单的数据帧自连接，如下所示，它可以正常工作，但是在对数据帧进行了几次操作（如添加列或与其他数据帧连接）后，就会出现上述错误。

df.join(df,on='item_listed')

Using dataframe aliases like bellow wont work either and the same error message is raised: 使用像波纹管这样的数据框别名也不起作用，并且会出现相同的错误消息：

df.alias('A').join(df.alias('B'), col('A.my_id') == col('B.my_id'))

Answer 1

I've found a Java workaround here SPARK-14948 and for pyspark is like this: 我在SPARK-14948上找到了Java解决方法，对于pyspark来说是这样的：

#Add a "_r" suffix to column names array
newcols = [c + '_r' for c in df.columns]

#clone the dataframe with columns renamed
df2 = df.toDF(*newcols)

#self-join
df.join(df2,df.my_column == df2.my_column_r)

Pyspark自联接，错误为“缺少已解决的属性”

问题描述

1 个解决方案

解决方案1
1 2019-07-02 18:24:08

Pyspark自联接，错误为“缺少已解决的属性”

问题描述

1 个解决方案

解决方案1 1 2019-07-02 18:24:08

解决方案1
1 2019-07-02 18:24:08