[英]Split PySpark Dataframe column into multiple
I have a pyspark dataframe column which has data as below.我有一个 pyspark dataframe 列,其数据如下。
Column 1
A1,A2
B1
C1,C2
D2
I have to split the column into 2 columns based on comma.我必须根据逗号将列拆分为 2 列。 Output Should be as below.
Output 应如下所示。
Column 1 Column 2
A1 A2
B1
C1 C2
D2
I tried using the split() function but my B1 and D2 are getting populated in column 1 instead of column 2. Is there a way to achieve the above output?我尝试使用 split() function 但我的 B1 和 D2 填充在第 1 列而不是第 2 列中。有没有办法实现上述 output?
Here is one way using split and size :这是使用split和size的一种方法:
from pyspark.sql.functions import split, size, col, when
df.withColumn("ar", split(df["Column 1"], ",")) \
.withColumn("Column 2", when(
size(col("ar")) == 1, col("ar")[0])
.otherwise(col("ar")[1])) \
.withColumn("Column 1", when(size(col("ar")) == 2, col("ar")[0])) \
.drop("ar") \
.show()
# +--------+--------+
# |Column 1|Column 2|
# +--------+--------+
# | A1| A2|
# | null| B1|
# | C1| C2|
# | null| D2|
# +--------+--------+
First we split Column 1
by comma then we access the items of the array conditionally.首先我们用逗号分割
Column 1
,然后我们有条件地访问数组的项目。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.