简体   繁体   中英

Pandas, remove everything after last '_'

I have the following kind of strings in my column seen below. I would like to parse out everything after the last _ of each string, and if there is no _ then leave the string as-is. (as my below try will just exclude strings with no _ )

so far I have tried below, seen here: Python pandas: remove everything after a delimiter in a string . But it is just parsing out everything after first _

d6['SOURCE_NAME'] = d6['SOURCE_NAME'].str.split('_').str[0]

Here are some example strings in my SOURCE_NAME column.

Stackoverflow_1234
Stack_Over_Flow_1234
Stackoverflow
Stack_Overflow_1234

Expected:

Stackoverflow
Stack_Over_Flow
Stackoverflow
Stack_Overflow

any help would be appreciated.

Use a combination of str.rsplit and str.get for your desired outcome. str.rsplit simply splits a string from the end, while str.get gets the nth element of an iterator within a pd.Series object.


Answer

d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)

the n argument in rsplit limits number of splits in output so that you only keep everything before the last '_'.

Even though a solution using pd.Series.apply is almost half as fast, I like this one because is more expressive in it's syntax. If you want to use the pd.Series.apply solution (faster) check the timing part!

pandas documentation .


Example

strs = ['Stackoverflow_1234',
        'Stack_Over_Flow_1234',
        'Stackoverflow',
        'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})

This results in

print(df)
            SOURCE_NAME
0    Stackoverflow_1234
1  Stack_Over_Flow_1234
2         Stackoverflow
3   Stack_Overflow_1234

Using the proposed solution:

df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)

0      Stackoverflow
1    Stack_Over_Flow
2      Stackoverflow
3     Stack_Overflow
Name: SOURCE_NAME, dtype: object

Time

Interestingly, using pd.Series.str is not necessarily faster than using pd.Series.apply :

import pandas as pd

df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# increasing the number of rows x 100
df = pd.concat([df] * 100)

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

you could try applying lambda as such:

d6['SOURCE_NAME'] = df['SOURCE_NAME'].apply(lambda x: x.split('_')[0])

Hope that helps!

Using rsplit() returns what you want to achieve, you can tell it how many times to split your string.

s = "Stack_Over_Flow_1234"
s.rsplit('_', 1)[0] # Split my string one time and get the first part of it

This then returns 'Stack_Over_Flow'

You can use the string.split('_') function to split the string into a list of substrings around every underscore, then recombine them without the last element. Here is a snippet using your examples:

a = ["Stackoverflow_1234", "Stack_Over_Flow_1234", "Stackoverflow", "Stack_Overflow_1234"]

for e in a:

    # Split the string into a list, separated at '_'
    splitStr = e.split("_")

    # If there is only 1 element, we can use it directly
    if len(splitStr) == 1:
        print(splitStr[0])

    # Slice off the final substring and join the remaining 
    # substrings back together with underscores
    else:
        print("_".join(splitStr[:-1]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM