[英]Slicing Dataframe column based on length of strings
I would like to remove the first 3 characters from strings in a Dataframe column where the length of the string is > 4 我想从字符串长度大于4的Dataframe列中的字符串中删除前3个字符
If else they should remain the same. 否则,它们应保持不变。
Eg 例如
bloomberg_ticker_y
AIM9
DJEM9 # (should be M9)
FAM9
IXPM9 # (should be M9)
I can filter the strings by length: 我可以按长度过滤字符串:
merged['bloomberg_ticker_y'].str.len() > 4
and slice the strings: 并切片字符串:
merged['bloomberg_ticker_y'].str[-2:]
But not sure how to put this together and apply it to my dataframe 但不确定如何将它们放在一起并将其应用于我的数据框
Any help would be appreciated. 任何帮助,将不胜感激。
You can use a list comprehension : 您可以使用列表推导:
df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})
df['new'] = [x[-2:] if len(x)>4 else x for x in df['bloomberg_ticker_y']]
Output : 输出:
bloomberg_ticker_y new
0 AIM9 AIM9
1 DJEM9 M9
2 FAM9 FAM9
3 IXPM9 M9
You can use numpy.where
to apply a condition to pick slices based on string length. 您可以使用
numpy.where
施加条件以根据字符串长度选择切片。
np.where(df['bloomberg_ticker_y'].str.len() > 4,
df['bloomberg_ticker_y'].str[3:],
df['bloomberg_ticker_y'])
# array(['AIM9', 'M9', 'FAM9', 'M9'], dtype=object)
df['bloomberg_ticker_sliced'] = (
np.where(df['bloomberg_ticker_y'].str.len() > 4,
df['bloomberg_ticker_y'].str[3:],
df['bloomberg_ticker_y']))
df
bloomberg_ticker_y bloomberg_ticker_sliced
0 AIM9 AIM9
1 DJEM9 M9
2 FAM9 FAM9
3 IXPM9 M9
If you fancy a vectorized map
based solution, it is 如果您喜欢基于矢量
map
的解决方案,那就可以了
df['bloomberg_ticker_y'].map(lambda x: x[3:] if len(x) > 4 else x)
0 AIM9
1 M9
2 FAM9
3 M9
Name: bloomberg_ticker_y, dtype: object
Saw a quite big variety of answers, so decided to compare them in terms of speed: 看到了各种各样的答案,因此决定比较它们的速度:
# Create big size test dataframe
df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})
df = pd.concat([df]*100000)
df.shape
#Out
(400000, 1)
CS95 #1 np.where
CS95#1
np.where
%%timeit
np.where(df['bloomberg_ticker_y'].str.len() > 4,
df['bloomberg_ticker_y'].str[3:],
df['bloomberg_ticker_y'])
Result: 结果:
163 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CS95 #2 vectorized map
based solution CS95#2矢量
map
化解决方案
%%timeit
df['bloomberg_ticker_y'].map(lambda x: x[3:] if len(x) > 4 else x)
Result: 结果:
86 ms ± 7.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Yatu DataFrame.mask
Yatu
DataFrame.mask
%%timeit
df.bloomberg_ticker_y.mask(df.bloomberg_ticker_y.str.len().gt(4),
other=df.bloomberg_ticker_y.str[-2:])
Result: 结果:
187 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Vlemaistre list comprehension
Vlemaistre
list comprehension
%%timeit
[x[-2:] if len(x)>4 else x for x in df['bloomberg_ticker_y']]
Result: 结果:
84.8 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pault str.replace
with regex
pault
str.replace
用regex
%%timeit
df["bloomberg_ticker_y"].str.replace(r".{3,}(?=.{2}$)", "")
Result: 结果:
324 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Cobra DataFrame.apply
眼镜蛇
DataFrame.apply
%%timeit
df.apply(lambda x: (x['bloomberg_ticker_y'][3:] if len(x['bloomberg_ticker_y']) > 4 else x['bloomberg_ticker_y']) , axis=1)
Result: 结果:
6.83 s ± 387 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion 结论
Fastest method is list comprehension
closely followed by vectorized map
based solution. 最快的方法是紧紧的
list comprehension
然后是基于矢量map
的解决方案。
Slowest method is DataFrame.apply
by far (as expected) followed by str.replace
with regex
最慢的方法是
DataFrame.apply
迄今为止(如预期),接着str.replace
与regex
You can use DataFrame.mask
: 您可以使用
DataFrame.mask
:
df['bloomberg_ticker_y'] = (df.bloomberg_ticker_y.mask(
df.bloomberg_ticker_y.str.len().gt(4),
other=df.bloomberg_ticker_y.str[-2:]))
bloomberg_ticker_y
0 AIM9
1 M9
2 FAM9
3 M9
You can also use DataFrame.apply : 您还可以使用DataFrame.apply :
import pandas as pd
df = pd.DataFrame({'bloomberg_ticker_y' : ['AIM9', 'DJEM9', 'FAM9', 'IXPM9']})
df['bloomberg_ticker_y'] = df.apply(lambda x: (x['bloomberg_ticker_y'][3:] if len(x['bloomberg_ticker_y']) > 4 else x['bloomberg_ticker_y']) , axis=1)
Output : 输出:
bloomberg_ticker_y
0 AIM9
1 M9
2 FAM9
3 M9
Another approach is to use regular expressions: 另一种方法是使用正则表达式:
df["bloomberg_ticker_y"].str.replace(r".{3,}(?=.{2}$)", "")
#0 AIM9
#1 M9
#2 FAM9
#3 M9
The pattern means: 该模式表示:
.{3,}
: Match 3 or more characters .{3,}
:匹配3个或更多字符 (?=.{2}$)
: Positive look ahead for exactly 2 characters followed by the end of the string. (?=.{2}$)
:正好向前看正好2个字符,后跟字符串的结尾。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.