如何在 pyspark 中新建列

Question

I have a spark dataframe as below我有一个火花 dataframe 如下

a1 a1	a2 a2	a3 a3	a4 a4
A一个	12 12	9 9	1 1
B乙	14 14	13 13	1 1
C C	7 7	3 3	0 0

I want to create a new column A5 such with conditioning such as我想创建一个新列 A5 ，例如

       if a1 = A then a5 = Car
       if a2>0   then a5 = Bus
       if a3>0 and a4 =1 then a5 = Bike

The desired output should as below所需的 output 应如下所示

a1 a1	a2 a2	a3 a3	a4 a4	a5 a5
A一个	12 12	9 9	1 1	Car车
A一个	12 12	9 9	1 1	Bus公共汽车
A一个	12 12	9 9	1 1	Bike自行车
B乙	14 14	13 13	1 1	Bus公共汽车
B乙	14 14	13 13	1 1	Bike自行车
C C	7 7	3 3	0 0	Bus公共汽车

Please help on how to add this new column.请帮助如何添加这个新列。 Thank you in advance先感谢您

Answer 1

You can define your own function您可以定义自己的 function

from pyspark.sql import functions as f
from pyspark.sql.types import StringType


def myfunction(a1, a2, a3):
   if a1 == "A":
       return "Car"
   elif a2 > 0:
       return "Bus"
   elif a3 > 0 and a4 == 1:
       return "Bike"

df = df.withColumn("a5", f.udf(myfunction(df.a1, df.a2, df.a3), StringType()))

Answer 2

df = spark.createDataFrame(
    [
    ('A','12','9','1'),
    ('B','14','13','1'),
    ('C','7','3','0')
    ],
    ['a1','a2','a3','a4']
)

from pyspark.sql.functions import when, lit, col

res = df.withColumn("a5",when(df.a1 == 'A', lit('Car')))\
        .unionByName(df.withColumn("a5",when(df.a2 > 0, lit('Bus'))))\
        .unionByName(df.withColumn("a5",when((df.a3 > 0)&(df.a4 == 1), lit('Bike'))))\
        .filter(col('a5').isNotNull())\
        .orderBy(col('a1').asc())


res.show()

# +---+---+---+---+----+
# | a1| a2| a3| a4|  a5|
# +---+---+---+---+----+
# |  A| 12|  9|  1| Bus|
# |  A| 12|  9|  1| Car|
# |  A| 12|  9|  1|Bike|
# |  B| 14| 13|  1| Bus|
# |  B| 14| 13|  1|Bike|
# |  C|  7|  3|  0| Bus|
# +---+---+---+---+----+

Answer 3

The (probably) best approach performance-wise is creating some temporary columns according to your logic, add them to an array, then explode them to get more rows. （可能）最好的性能方法是根据您的逻辑创建一些临时列，将它们添加到数组中，然后分解它们以获得更多行。

from pyspark.sql import functions as F

(df
    .withColumn('a5_1', F.when(F.col('a1') == 'A', 'Car'))
    .withColumn('a5_2', F.when(F.col('a2') > 0, 'Bus'))
    .withColumn('a5_3', F.when((F.col('a3') > 0) & (F.col('a4') == 1), 'Bike'))
    .withColumn('a5', F.array_except(F.array('a5_1', 'a5_2', 'a5_3'), F.array(F.lit(None))))
    .select(df.columns + [F.explode('a5').alias('a5')])
    .show()
)

+---+---+---+---+----+
| a1| a2| a3| a4|  a5|
+---+---+---+---+----+
|  A| 12|  9|  1| Car|
|  A| 12|  9|  1| Bus|
|  A| 12|  9|  1|Bike|
|  B| 14| 13|  1| Bus|
|  B| 14| 13|  1|Bike|
|  C|  7|  3|  0| Bus|
+---+---+---+---+----+

如何在 pyspark 中新建列

问题描述

3 个解决方案

解决方案1
0 2022-09-14 15:12:24

解决方案2
0 2022-09-15 12:38:09

解决方案3
0 2022-09-19 09:40:23

如何在 pyspark 中新建列

问题描述

3 个解决方案

解决方案1 0 2022-09-14 15:12:24

解决方案2 0 2022-09-15 12:38:09

解决方案3 0 2022-09-19 09:40:23

解决方案1
0 2022-09-14 15:12:24

解决方案2
0 2022-09-15 12:38:09

解决方案3
0 2022-09-19 09:40:23