[英]slice a string by different characters using Python Pandas
how to slice a string in dataframe, start from left, based on different characters, such as ' /- . 如何根据不同的字符(例如'/-)从左开始在数据帧中切片字符串。 , I only want the first time this character shows up.
,我只希望这个角色第一次出现。
key name
1 McDonald's
2 CVS/PHARMACY
3 CVS/Store
4 WAL-MART
5 AMAZON.CO
expect result: 预期结果:
key name for_Group
1 McDonald's McDonald
2 CVS/PHARMACY CVS
3 CVS/Store CVS
4 WAL-MART WAL
5 AMAZON.CO AMAZON
I'm not sure if this need to use regular expression? 我不确定是否需要使用正则表达式?
Option 1 选项1
str.split
with expand=True
str.split
与expand=True
df['for_group'] = df.name.str.split(r"[\'\/\-\.]", expand=True)[0]
key name for_group
0 1 McDonald's McDonald
1 2 CVS/PHARMACY CVS
2 3 CVS/Store CVS
3 4 WAL-MART WAL
4 5 AMAZON.CO AMAZON
Option 2 (Best option) 选项2 (最佳选项)
str.extract
(I personally prefer this one, it matches until it finds one of your desired stop characters) str.extract
(我个人更喜欢这一点,它会匹配直到找到所需的停止字符之一)
df.name.str.extract(r'(.*?)[\'\/\-\.]', expand=False)
0 McDonald
1 CVS
2 CVS
3 WAL
4 AMAZON
The second option here is much faster: 这里的第二个选项要快得多:
df = pd.concat([df]*10000)
%timeit df.name.str.split(r"[\'\/\-\.]", expand=True)[0]
141 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.name.str.extract(r'(.*)[\'\/\-\.]', expand=False)
72.6 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 1 方法1
You can use the below regular expression, which means a word character (az etc.) repeated one or more times. 您可以使用下面的正则表达式,这表示一个单词字符(az等)重复了一次或多次。 This returns an array and you can take the first element off it.
这将返回一个数组,您可以删除第一个元素。
import re
df['for_group'] = df['name'].apply(lambda x: re.findall(r"[\w]+", x)[0])
A faster approach to regular expression would be to use .search()
as pointed out by @user3483203 正则表达式的一种更快的方法是使用@ user3483203指出的.search
.search()
df['for_group'] = df['name'].apply(lambda x: re.search(r"[\w]+", x).group())
Method 2 方法2
Similarly, you can use: 同样,您可以使用:
df['for_group'] = df.name.str.split('\W+').apply(lambda x: x[0])
Output: 输出:
key name for_group
0 1 McDonald's McDonald
1 2 CVS/PHARMACY CVS
2 3 CVS/Store CVS
3 4 WAL-MART WAL
4 5 AMAZON.CO AMAZON
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.