[英]PYSPARK set default value in duplicated column name
In pyspark, I have a dataframe with 10 columns like this : 在pyspark中,我有一个包含10列的数据框,如下所示:
id, last_name, first_name, manager, shop, location, manager, place, country, status
i would like to set a default value to only the first column manager, i've tried with : 我想只为第一个列管理器设置一个默认值,我尝试过:
df.withColumn("manager", "x1")
but it gives me an error for ambiguous reference as there is 2 columns with the same name. 但这给我一个歧义引用的错误,因为有2列具有相同的名称。
Is there a way to do it without renaming the column ? 有没有一种方法可以不重命名该列?
One work around can be,to recreate the dataframe changing the column names. 一种解决方法是重新创建更改列名称的数据框。 It's always better to have unique column names. 拥有唯一的列名总是更好。
>>> df = spark.createDataFrame([('value1','value2'),('value3','value4')],['manager','manager'])
>>> df.show()
+-------+-------+
|manager|manager|
+-------+-------+
| value1| value2|
| value3| value4|
+-------+-------+
>>> df1 = df.toDF('manager1','manager2')
>>> from pyspark.sql.functions import lit
>>> df1.withColumn('manager1',lit('x1')).show()
+--------+--------+
|manager1|manager2|
+--------+--------+
| x1| value2|
| x1| value4|
+--------+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.