简体   繁体   中英

Split PySpark Dataframe column into multiple

I have a pyspark dataframe column which has data as below.

Column 1
A1,A2
B1
C1,C2
D2

I have to split the column into 2 columns based on comma. Output Should be as below.

Column 1    Column 2
A1          A2
            B1
C1          C2
            D2

I tried using the split() function but my B1 and D2 are getting populated in column 1 instead of column 2. Is there a way to achieve the above output?

Here is one way using split and size :

from pyspark.sql.functions import split, size, col, when

df.withColumn("ar", split(df["Column 1"], ",")) \
  .withColumn("Column 2", when(
                             size(col("ar")) == 1, col("ar")[0])
                             .otherwise(col("ar")[1])) \
  .withColumn("Column 1", when(size(col("ar")) == 2, col("ar")[0])) \
  .drop("ar") \
  .show()

# +--------+--------+
# |Column 1|Column 2|
# +--------+--------+
# |      A1|      A2|
# |    null|      B1|
# |      C1|      C2|
# |    null|      D2|
# +--------+--------+

First we split Column 1 by comma then we access the items of the array conditionally.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM