I have one dataframe and within that dataframe there is a column that contains a string value. I need to extract a substring from that column whenever a certain set of characters are present and convert that into a new column. I want to be able to do this while also not filtering as to not lose all the other rows. For any row that doesn't contain that specific string value I want the corresponding new column to read as "null". So for example lets say I have the following dataframe:
+---------------------------------------+----------+---------+
|id |compliance|workflow |
+---------------------------------------+----------+---------+
|account/product/rule-id/r-1879bajhdfd80|PASS | NEW|
|account/product/rule-id/r-198Hhfu89421s|PASS | NEW|
|account/product/test/run/date/YYYYMMDD |FAIL | NEW|
+---------------------------------------+----------+---------+
I want to be able to identify the substring 'rule-id' and create a new column called 'rule-id' and for the rows that don't have that substring I want the value to be null. So for example the output should look like this:
+---------------------------------------+----------+---------+---------------+
|id |compliance|workflow |rule-id |
+---------------------------------------+----------+---------+---------------+
|account/product/rule-id/r-1879bajhdfd80|PASS | NEW|r-1879bajhdfd80|
|account/product/rule-id/r-198Hhfu89421s|PASS | NEW|r-198Hhfu89421s|
|account/product/test/run/date/YYYYMMDD |FAIL | NEW|null |
+---------------------------------------+----------+---------+---------------+
I know I can use the substring() function to extract the portion of the string I want but that will do it for all rows giving me some odd rule-id values.
df2 = df1.withColumn("rule-id", substring("id", 25, 15))
However, how do I write it so that if the 'rule-id' substring is present in the 'id' string value it extracts the substring I am looking for but only for those rows and the rest will get a "null" value for the new 'rule-id' column?
If it only need to handle described case (so length of id is not going to change and the pattern will be similar all the time) you can just add when/otherwise with another substring check
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
("account/product/rule-id/r-1879bajhdfd80", "PASS", "NEW"),
("account/product/rule-id/r-198Hhfu89421s", "PASS", "NEW"),
("account/product/test/run/date/YYYYMMDD", "FAIL", "NEW"),
]
df1 = spark.createDataFrame(inputData, schema=["id", "compliance", "workflow"])
df1.show()
df2 = df1.withColumn(
"rule-id",
F.when(
F.substring("id", 17, 7) == F.lit("rule-id"), F.substring("id", 25, 15)
).otherwise(None),
).show()
output
+--------------------+----------+--------+---------------+
| id|compliance|workflow| rule-id|
+--------------------+----------+--------+---------------+
|account/product/r...| PASS| NEW|r-1879bajhdfd80|
|account/product/r...| PASS| NEW|r-198Hhfu89421s|
|account/product/t...| FAIL| NEW| null|
+--------------------+----------+--------+---------------+
If it should be more flexible first substring should we changed for pattern check
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.