Create a New Column in PySpark Dataframe that Contains Substring of Another Column

Question

I have one dataframe and within that dataframe there is a column that contains a string value. I need to extract a substring from that column whenever a certain set of characters are present and convert that into a new column. I want to be able to do this while also not filtering as to not lose all the other rows. For any row that doesn't contain that specific string value I want the corresponding new column to read as "null". So for example lets say I have the following dataframe:

+---------------------------------------+----------+---------+
|id                                     |compliance|workflow |
+---------------------------------------+----------+---------+
|account/product/rule-id/r-1879bajhdfd80|PASS      |      NEW|
|account/product/rule-id/r-198Hhfu89421s|PASS      |      NEW|
|account/product/test/run/date/YYYYMMDD |FAIL      |      NEW|
+---------------------------------------+----------+---------+

I want to be able to identify the substring 'rule-id' and create a new column called 'rule-id' and for the rows that don't have that substring I want the value to be null. So for example the output should look like this:

+---------------------------------------+----------+---------+---------------+
|id                                     |compliance|workflow |rule-id        |
+---------------------------------------+----------+---------+---------------+
|account/product/rule-id/r-1879bajhdfd80|PASS      |      NEW|r-1879bajhdfd80|
|account/product/rule-id/r-198Hhfu89421s|PASS      |      NEW|r-198Hhfu89421s|
|account/product/test/run/date/YYYYMMDD |FAIL      |      NEW|null           |
+---------------------------------------+----------+---------+---------------+

I know I can use the substring() function to extract the portion of the string I want but that will do it for all rows giving me some odd rule-id values.

df2 = df1.withColumn("rule-id", substring("id", 25, 15))

However, how do I write it so that if the 'rule-id' substring is present in the 'id' string value it extracts the substring I am looking for but only for those rows and the rest will get a "null" value for the new 'rule-id' column?

Answer 1

If it only need to handle described case (so length of id is not going to change and the pattern will be similar all the time) you can just add when/otherwise with another substring check

import pyspark.sql.functions as F
from pyspark.sql import Window

inputData = [
    ("account/product/rule-id/r-1879bajhdfd80", "PASS", "NEW"),
    ("account/product/rule-id/r-198Hhfu89421s", "PASS", "NEW"),
    ("account/product/test/run/date/YYYYMMDD", "FAIL", "NEW"),
]
df1 = spark.createDataFrame(inputData, schema=["id", "compliance", "workflow"])
df1.show()

df2 = df1.withColumn(
    "rule-id",
    F.when(
        F.substring("id", 17, 7) == F.lit("rule-id"), F.substring("id", 25, 15)
    ).otherwise(None),
).show()

output

+--------------------+----------+--------+---------------+
|                  id|compliance|workflow|        rule-id|
+--------------------+----------+--------+---------------+
|account/product/r...|      PASS|     NEW|r-1879bajhdfd80|
|account/product/r...|      PASS|     NEW|r-198Hhfu89421s|
|account/product/t...|      FAIL|     NEW|           null|
+--------------------+----------+--------+---------------+

If it should be more flexible first substring should we changed for pattern check

Create a New Column in PySpark Dataframe that Contains Substring of Another Column

Question

1 answers

solution1
1 ACCPTED 2023-02-01 09:08:31

Create a New Column in PySpark Dataframe that Contains Substring of Another Column

Question

1 answers

solution1 1 ACCPTED 2023-02-01 09:08:31

solution1
1 ACCPTED 2023-02-01 09:08:31