简体   繁体   中英

Why does PySpark select statement complain about ambiguous columns?

I wrote the following code to test (spark 3.2.1) how I can resolve multiple columns with the same name (spark is the spark session):

import pyspark.sql.functions as F

data = [['model 1', 10],
        ['model 1', 20],
        ['model 1', 10],
        ['model 2', 11],
        ['model 2', 21],
        ['model 2', 21],
        ]
data = spark.createDataFrame(data, schema=['model', 'capacity_bytes'])
capacity_counts = data.groupby('model', 'capacity_bytes').agg(F.count("*").alias('capacity_occurrence_count'))
capacity_counts_max = capacity_counts.groupby('model').agg(F.max('capacity_occurrence_count').alias('capacity_occurrence_count_max'))
conds = (capacity_counts['model']==capacity_counts_max['model']) & (capacity_counts['capacity_occurrence_count']==capacity_counts_max['capacity_occurrence_count_max'])
# res = capacity_counts.alias('capacity_counts').join(capacity_counts_max.alias('capacity_counts_max'), on=conds)
res = capacity_counts_max.join(capacity_counts, on=conds)

# fails with pyspark.sql.utils.AnalysisException:  Column model#18 are ambiguous
res.select(capacity_counts['model'],'capacity_bytes').show()

# succeeds
res.select(capacity_counts_max['model'],'capacity_bytes').show()

I cannot understand why one of the select statements succeeds and the other one fails. I am aware that I can use an alias for the dataframes, but I still do not understand the above behaviour.

Many thanks for your help.

It is working for me, I don't think so t should fail as you are explicitly referring model1 from capacity_counts - capacity_counts['model']

res = capacity_counts_max.join(capacity_counts, on=conds)
res.show()
res.select(capacity_counts['model'],'capacity_bytes').show()
# +-------+--------------+
# |  model|capacity_bytes|
# +-------+--------------+
# |model 2|            21|
# |model 1|            10|
# +-------+--------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM