Pyspark指定列的默认值

Question

I'm using Spark 1.6.1 and Python 2.7 我正在使用Spark 1.6.1和Python 2.7

I'm trying to figure out how I can specify a default value for a newly added column to a dataframe. 我试图找出如何为数据帧的新添加列指定默认值。 Take this scenario where I have a dataframe named df1 that contains 在这个场景中，我有一个名为df1的数据帧

+-------+----+
|user_id| age|
+-------+----+
|  10000|  45|
|  10013|  40|
|  10021|Null|
|  10025|  50|
|  10051|  31|
+-------+----+

Now I want to add a new column called age2 that just has a simple function age + 1 现在我想添加一个名为age2的新列，它只有一个简单的函数年龄+1

>>> df1 = df1.withColumn("age2", df1["age"]+1)

+-------+----+----+
|user_id| age|age2|
+-------+----+----+
|  10000|  45|  46|
|  10013|  40|  41|
|  10021|Null|Null|
|  10025|  50|  51|
|  10051|  31|  32|
+-------+----+----+

Is there a way I can specify a default value for age2 so that instead of returning a null for age2 when there is a null in age, I can return something like 1 so that I get 有没有办法我可以为age2指定一个默认值，这样当年龄为null时，我可以返回类似于1的东西，而不是为age2返回null。

+-------+----+----+
|user_id| age|age2|
+-------+----+----+
|  10000|  45|  46|
|  10013|  40|  41|
|  10021|Null|   1|
|  10025|  50|  51|
|  10051|  31|  32|
+-------+----+----+

I know that I can use a UDF to do this, but I want to know if there is a built in way to do it instead. 我知道我可以使用UDF来执行此操作，但我想知道是否有内置方法来执行此操作。

Answer 1

I would suggest you to use fillna function. 我建议你使用fillna功能。 Create a new column as you are doing it currently. 当前正在进行创建新列。 Next fill the null values with the fillna function 接下来使用fillna函数填充空值

>>> df1 = df1.withColumn("age2", df1["age"]+1)
>>> df1 = df1.na.fill({'age2': 1})

Pyspark指定列的默认值

问题描述

1 个解决方案

解决方案1
0 2017-03-30 20:23:46

Pyspark指定列的默认值

问题描述

1 个解决方案

解决方案1 0 2017-03-30 20:23:46

解决方案1
0 2017-03-30 20:23:46