简体   繁体   中英

Split string IF delimiter is found

Consider the following data frame,

cols = ['Id', 'col1', 'col2', 'col3']
vals = [('1A','Not his side|:|This side', 'This side', 'Not this either|:|but this'),  
        ('1B','Keep this', 'This one|:|keep this', 'remove|:|keep that')]

dd1 = sqlContext.createDataFrame(vals, cols)

#+---+------------------------+--------------------+--------------------------+
#|Id |col1                    |col2                |col3                      |
#+---+------------------------+--------------------+--------------------------+
#|1A |Not his side|:|This side|This side           |Not this either|:|but this|
#|1B |Keep this               |This one|:|keep this|remove|:|keep that        |
#+---+------------------------+--------------------+--------------------------+

What I need to do is to split the strings at |:| and keep the second word. However, If the string does not contain the delimiter ( |:| ), then I get null , ie:

users1 = [F.split(F.col(x), "\\|:\\|").alias(x) for x in cols]
dd1.select(*users1).show()

#+---------+---------+---------+
#|     col1|     col2|     col3|
#+---------+---------+---------+
#|This side|     null| but this|
#|     null|keep this|keep that|
#+---------+---------+---------+

The result I'm looking for is:

+---+---------+---------+---------+
|Id |col1     |col2     |col3     |
+---+---------+---------+---------+
|1A |This side|This side|but this |
|1B |Keep this|keep this|keep that|
+---+---------+---------+---------+

Use when and otherwise and check if the string contains "|:|" . It can be done as follows:

cols = ['col1', 'col2', 'col3']

users1 = [F.when(F.col(x).contains("|:|"), F.split(F.col(x), "\\|:\\|")[1]).otherwise(F.col(x)).alias(x) for x in cols]
dd1.select(F.col('Id'), *users1)

Here the cols only include the columns that you want to split. The final select will include the Id column as well.

You can use when and size inbuilt functions as

users1 = [F.when(F.size(F.split(F.col(x), "\\|:\\|")) > 1, F.split(F.col(x), "\\|:\\|")[1]).otherwise(F.col(x)).alias(x) for x in cols]
dd1.select(*users1).show()

which should give you

+---+---------+---------+---------+
| Id|     col1|     col2|     col3|
+---+---------+---------+---------+
| 1A|This side|This side| but this|
| 1B|Keep this|keep this|keep that|
+---+---------+---------+---------+

You can modify the answer so that you can use split function only once.

I hope the answer is helpful and should be a good hint on how to proceed.

The selection you defined in your questions does not yield the result you mentioned. I think you forgot to select the element from the array ;)

users1 = [F.split(F.col(x), "\\|:\\|").alias(x) for x in cols]
dd1.select(*users1).show()
+----+--------------------+--------------------+--------------------+
|  Id|                col1|                col2|                col3|
+----+--------------------+--------------------+--------------------+
|[1A]|[Not his side, Th...|         [This side]|[Not this either,...|
|[1B]|         [Keep this]|[This one, keep t...| [remove, keep that]|
+----+--------------------+--------------------+--------------------+

To achieve what you want, you simply need to select the last element of the array. You can use the size function for this:

users2 = [F.col(x).getItem(F.size(F.col(x))-1).alias(x) for x in cols]
dd1.select(*users1).select(*users2).show()
+---+---------+---------+---------+
| Id|     col1|     col2|     col3|
+---+---------+---------+---------+
| 1A|This side|This side| but this|
| 1B|Keep this|keep this|keep that|
+---+---------+---------+---------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM