简体   繁体   English

Pandas,删除最后一个'_'之后的所有内容

[英]Pandas, remove everything after last '_'

I have the following kind of strings in my column seen below.我的专栏中有以下类型的字符串,如下所示。 I would like to parse out everything after the last _ of each string, and if there is no _ then leave the string as-is.我想解析每个字符串的最后一个_之后的所有内容,如果没有_则保留字符串原样。 (as my below try will just exclude strings with no _ ) (因为我下面的尝试只会排除没有_的字符串)

so far I have tried below, seen here: Python pandas: remove everything after a delimiter in a string .到目前为止,我已经在下面尝试过,在这里看到: Python pandas: remove all after a delimiter in a string But it is just parsing out everything after first _但它只是在第一个_之后解析所有内容

d6['SOURCE_NAME'] = d6['SOURCE_NAME'].str.split('_').str[0]

Here are some example strings in my SOURCE_NAME column.以下是我的 SOURCE_NAME 列中的一些示例字符串。

Stackoverflow_1234
Stack_Over_Flow_1234
Stackoverflow
Stack_Overflow_1234

Expected:预期的:

Stackoverflow
Stack_Over_Flow
Stackoverflow
Stack_Overflow

any help would be appreciated.任何帮助,将不胜感激。

Use a combination of str.rsplit and str.get for your desired outcome.使用str.rsplitstr.get的组合以获得您想要的结果。 str.rsplit simply splits a string from the end, while str.get gets the nth element of an iterator within a pd.Series object. str.rsplit只是从末尾拆分字符串,而str.get获取 pd.Series object 中迭代器的第 n 个元素。


Answer回答

d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)

the n argument in rsplit limits number of splits in output so that you only keep everything before the last '_'. rsplit 中的n参数限制了rsplit中的拆分次数,因此您只保留最后一个“_”之前的所有内容。

Even though a solution using pd.Series.apply is almost half as fast, I like this one because is more expressive in it's syntax.尽管使用pd.Series.apply的解决方案几乎快了一半,但我喜欢这个解决方案,因为它的语法更具表现力。 If you want to use the pd.Series.apply solution (faster) check the timing part!如果您想使用pd.Series.apply解决方案(更快),请检查计时部分!

pandas documentation . pandas 文档


Example例子

strs = ['Stackoverflow_1234',
        'Stack_Over_Flow_1234',
        'Stackoverflow',
        'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})

This results in这导致

print(df)
            SOURCE_NAME
0    Stackoverflow_1234
1  Stack_Over_Flow_1234
2         Stackoverflow
3   Stack_Overflow_1234

Using the proposed solution:使用建议的解决方案:

df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)

0      Stackoverflow
1    Stack_Over_Flow
2      Stackoverflow
3     Stack_Overflow
Name: SOURCE_NAME, dtype: object

Time时间

Interestingly, using pd.Series.str is not necessarily faster than using pd.Series.apply :有趣的是,使用pd.Series.str不一定比使用pd.Series.apply快:

import pandas as pd

df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# increasing the number of rows x 100
df = pd.concat([df] * 100)

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

you could try applying lambda as such:您可以尝试这样应用 lambda :

d6['SOURCE_NAME'] = df['SOURCE_NAME'].apply(lambda x: x.split('_')[0])

Hope that helps!希望有帮助!

Using rsplit() returns what you want to achieve, you can tell it how many times to split your string.使用 rsplit() 返回您想要实现的目标,您可以告诉它拆分字符串的次数。

s = "Stack_Over_Flow_1234"
s.rsplit('_', 1)[0] # Split my string one time and get the first part of it

This then returns 'Stack_Over_Flow'然后返回'Stack_Over_Flow'

You can use the string.split('_') function to split the string into a list of substrings around every underscore, then recombine them without the last element.您可以使用 string.split('_') function 将字符串拆分为围绕每个下划线的子字符串列表,然后在没有最后一个元素的情况下重新组合它们。 Here is a snippet using your examples:这是使用您的示例的代码段:

a = ["Stackoverflow_1234", "Stack_Over_Flow_1234", "Stackoverflow", "Stack_Overflow_1234"]

for e in a:

    # Split the string into a list, separated at '_'
    splitStr = e.split("_")

    # If there is only 1 element, we can use it directly
    if len(splitStr) == 1:
        print(splitStr[0])

    # Slice off the final substring and join the remaining 
    # substrings back together with underscores
    else:
        print("_".join(splitStr[:-1]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM