PYSPARK在重复的列名称中设置默认值

Question

In pyspark, I have a dataframe with 10 columns like this : 在pyspark中，我有一个包含10列的数据框，如下所示：

id, last_name, first_name, manager, shop, location, manager, place, country, status

i would like to set a default value to only the first column manager, i've tried with : 我想只为第一个列管理器设置一个默认值，我尝试过：

df.withColumn("manager", "x1")

but it gives me an error for ambiguous reference as there is 2 columns with the same name. 但这给我一个歧义引用的错误，因为有2列具有相同的名称。

Is there a way to do it without renaming the column ? 有没有一种方法可以不重命名该列？

Answer 1

One work around can be,to recreate the dataframe changing the column names. 一种解决方法是重新创建更改列名称的数据框。 It's always better to have unique column names. 拥有唯一的列名总是更好。

>>> df = spark.createDataFrame([('value1','value2'),('value3','value4')],['manager','manager'])
>>> df.show()
+-------+-------+
|manager|manager|
+-------+-------+
| value1| value2|
| value3| value4|
+-------+-------+
>>> df1 = df.toDF('manager1','manager2')

>>> from pyspark.sql.functions import lit
>>> df1.withColumn('manager1',lit('x1')).show()
+--------+--------+
|manager1|manager2|
+--------+--------+
|      x1|  value2|
|      x1|  value4|
+--------+--------+

PYSPARK在重复的列名称中设置默认值

问题描述

1 个解决方案

解决方案1
0 2017-11-26 14:54:11

PYSPARK在重复的列名称中设置默认值

问题描述

1 个解决方案

解决方案1 0 2017-11-26 14:54:11

解决方案1
0 2017-11-26 14:54:11