简体   繁体   English

使用Python Pandas按不同的字符分割字符串

[英]slice a string by different characters using Python Pandas

how to slice a string in dataframe, start from left, based on different characters, such as ' /- . 如何根据不同的字符(例如'/-)从左开始在数据帧中切片字符串。 , I only want the first time this character shows up. ,我只希望这个角色第一次出现。

key   name
1   McDonald's
2   CVS/PHARMACY
3   CVS/Store
4   WAL-MART
5   AMAZON.CO

expect result: 预期结果:

key   name            for_Group
1   McDonald's        McDonald
2   CVS/PHARMACY         CVS
3   CVS/Store            CVS
4   WAL-MART             WAL
5   AMAZON.CO          AMAZON

I'm not sure if this need to use regular expression? 我不确定是否需要使用正则表达式?

Option 1 选项1
str.split with expand=True str.splitexpand=True

df['for_group'] = df.name.str.split(r"[\'\/\-\.]", expand=True)[0]

   key          name for_group
0    1    McDonald's  McDonald
1    2  CVS/PHARMACY       CVS
2    3     CVS/Store       CVS
3    4      WAL-MART       WAL
4    5     AMAZON.CO    AMAZON

Option 2 (Best option) 选项2 (最佳选项)
str.extract (I personally prefer this one, it matches until it finds one of your desired stop characters) str.extract (我个人更喜欢这一点,它会匹配直到找到所需的停止字符之一)

df.name.str.extract(r'(.*?)[\'\/\-\.]', expand=False)

0    McDonald
1         CVS
2         CVS
3         WAL
4      AMAZON

The second option here is much faster: 这里的第二个选项要快得多:

df = pd.concat([df]*10000)

%timeit df.name.str.split(r"[\'\/\-\.]", expand=True)[0]
141 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.name.str.extract(r'(.*)[\'\/\-\.]', expand=False)
72.6 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Method 1 方法1

You can use the below regular expression, which means a word character (az etc.) repeated one or more times. 您可以使用下面的正则表达式,这表示一个单词字符(az等)重复了一次或多次。 This returns an array and you can take the first element off it. 这将返回一个数组,您可以删除第一个元素。

import re
df['for_group'] = df['name'].apply(lambda x: re.findall(r"[\w]+", x)[0])

A faster approach to regular expression would be to use .search() as pointed out by @user3483203 正则表达式的一种更快的方法是使用@ user3483203指出的.search .search()

df['for_group'] = df['name'].apply(lambda x: re.search(r"[\w]+", x).group())

Method 2 方法2

Similarly, you can use: 同样,您可以使用:

df['for_group'] = df.name.str.split('\W+').apply(lambda x: x[0])

Output: 输出:

    key          name for_group
0    1    McDonald's  McDonald
1    2  CVS/PHARMACY       CVS
2    3     CVS/Store       CVS
3    4      WAL-MART       WAL
4    5     AMAZON.CO    AMAZON

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM