简体   繁体   English

PYSPARK在重复的列名称中设置默认值

[英]PYSPARK set default value in duplicated column name

In pyspark, I have a dataframe with 10 columns like this : 在pyspark中,我有一个包含10列的数据框,如下所示:

id, last_name, first_name, manager, shop, location, manager, place, country, status

i would like to set a default value to only the first column manager, i've tried with : 我想只为第一个列管理器设置一个默认值,我尝试过:

df.withColumn("manager", "x1")

but it gives me an error for ambiguous reference as there is 2 columns with the same name. 但这给我一个歧义引用的错误,因为有2列具有相同的名称。

Is there a way to do it without renaming the column ? 有没有一种方法可以不重命名该列?

One work around can be,to recreate the dataframe changing the column names. 一种解决方法是重新创建更改列名称的数据框。 It's always better to have unique column names. 拥有唯一的列名总是更好。

>>> df = spark.createDataFrame([('value1','value2'),('value3','value4')],['manager','manager'])
>>> df.show()
+-------+-------+
|manager|manager|
+-------+-------+
| value1| value2|
| value3| value4|
+-------+-------+
>>> df1 = df.toDF('manager1','manager2')

>>> from pyspark.sql.functions import lit
>>> df1.withColumn('manager1',lit('x1')).show()
+--------+--------+
|manager1|manager2|
+--------+--------+
|      x1|  value2|
|      x1|  value4|
+--------+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM