[英]Pandas, remove everything after last '_'
I have the following kind of strings in my column seen below.我的专栏中有以下类型的字符串,如下所示。 I would like to parse out everything after the last
_
of each string, and if there is no _
then leave the string as-is.我想解析每个字符串的最后一个
_
之后的所有内容,如果没有_
则保留字符串原样。 (as my below try will just exclude strings with no _
) (因为我下面的尝试只会排除没有
_
的字符串)
so far I have tried below, seen here: Python pandas: remove everything after a delimiter in a string .到目前为止,我已经在下面尝试过,在这里看到: Python pandas: remove all after a delimiter in a string 。 But it is just parsing out everything after first
_
但它只是在第一个
_
之后解析所有内容
d6['SOURCE_NAME'] = d6['SOURCE_NAME'].str.split('_').str[0]
Here are some example strings in my SOURCE_NAME column.以下是我的 SOURCE_NAME 列中的一些示例字符串。
Stackoverflow_1234
Stack_Over_Flow_1234
Stackoverflow
Stack_Overflow_1234
Expected:预期的:
Stackoverflow
Stack_Over_Flow
Stackoverflow
Stack_Overflow
any help would be appreciated.任何帮助,将不胜感激。
Use a combination of str.rsplit
and str.get
for your desired outcome.使用
str.rsplit
和str.get
的组合以获得您想要的结果。 str.rsplit
simply splits a string from the end, while str.get
gets the nth element of an iterator within a pd.Series object. str.rsplit
只是从末尾拆分字符串,而str.get
获取 pd.Series object 中迭代器的第 n 个元素。
d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
the n
argument in rsplit
limits number of splits in output so that you only keep everything before the last '_'. rsplit 中的
n
参数限制了rsplit
中的拆分次数,因此您只保留最后一个“_”之前的所有内容。
Even though a solution using pd.Series.apply
is almost half as fast, I like this one because is more expressive in it's syntax.尽管使用
pd.Series.apply
的解决方案几乎快了一半,但我喜欢这个解决方案,因为它的语法更具表现力。 If you want to use the pd.Series.apply
solution (faster) check the timing part!如果您想使用
pd.Series.apply
解决方案(更快),请检查计时部分!
pandas documentation . pandas 文档。
strs = ['Stackoverflow_1234',
'Stack_Over_Flow_1234',
'Stackoverflow',
'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})
This results in这导致
print(df)
SOURCE_NAME
0 Stackoverflow_1234
1 Stack_Over_Flow_1234
2 Stackoverflow
3 Stack_Overflow_1234
Using the proposed solution:使用建议的解决方案:
df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)
0 Stackoverflow
1 Stack_Over_Flow
2 Stackoverflow
3 Stack_Overflow
Name: SOURCE_NAME, dtype: object
Interestingly, using pd.Series.str
is not necessarily faster than using pd.Series.apply
:有趣的是,使用
pd.Series.str
不一定比使用pd.Series.apply
快:
import pandas as pd
df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})
%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# increasing the number of rows x 100
df = pd.concat([df] * 100)
%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
you could try applying lambda as such:您可以尝试这样应用 lambda :
d6['SOURCE_NAME'] = df['SOURCE_NAME'].apply(lambda x: x.split('_')[0])
Hope that helps!希望有帮助!
Using rsplit() returns what you want to achieve, you can tell it how many times to split your string.使用 rsplit() 返回您想要实现的目标,您可以告诉它拆分字符串的次数。
s = "Stack_Over_Flow_1234"
s.rsplit('_', 1)[0] # Split my string one time and get the first part of it
This then returns 'Stack_Over_Flow'
然后返回
'Stack_Over_Flow'
You can use the string.split('_') function to split the string into a list of substrings around every underscore, then recombine them without the last element.您可以使用 string.split('_') function 将字符串拆分为围绕每个下划线的子字符串列表,然后在没有最后一个元素的情况下重新组合它们。 Here is a snippet using your examples:
这是使用您的示例的代码段:
a = ["Stackoverflow_1234", "Stack_Over_Flow_1234", "Stackoverflow", "Stack_Overflow_1234"]
for e in a:
# Split the string into a list, separated at '_'
splitStr = e.split("_")
# If there is only 1 element, we can use it directly
if len(splitStr) == 1:
print(splitStr[0])
# Slice off the final substring and join the remaining
# substrings back together with underscores
else:
print("_".join(splitStr[:-1]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.