简体   繁体   English

通过 str.contains() 索引,然后将值插入另一列

[英]Indexing by str.contains(), then inserting a value into another column

I have a dataframe of store names that I have to standardize.我有一个必须标准化的商店名称数据框。 For example McDonalds 1234 LA -> McDonalds .例如McDonalds 1234 LA -> McDonalds

import pandas as pd
import re

df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC ', 'Taco Restaurant', 'Lidl Berlin', 'Popeyes', 'Wallmart', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))

print(df)

   id              store
0   1          McDonalds
1   2               Lidl
2   3  Lidl New York 123
3   4               KFC 
4   5    Taco Restaurant
5   6        Lidl Berlin
6   7            Popeyes
7   8           Wallmart
8   9               Aldi
9  10        London Lidl

So let's say I want to standardize the Lidl stores.所以假设我想标准化Lidl商店。 The standard name will just be "Lidl.标准名称将只是“Lidl。

I would like find where Lidl is in the dataframe, and to create a new column df['standard_name'] and insert the standard name there.我想找到 Lidl 在数据框中的位置,并创建一个新列df['standard_name']并在那里插入标准名称。 However I can't figure this out.但是我无法弄清楚这一点。

I'll first create the column where the standard name will be inserted:我将首先创建将插入标准名称的列:

d['standard_name'] = pd.np.nan

Then search for instances of Lidl , and insert the cleaned name into standard_name .然后搜索Lidl 的实例,并将清理过的名称插入到standard_name

First of all the plan is to use str.contains and then set the standardized value to the new column:首先的计划是使用str.contains然后将标准化值设置为新列:

df[df.store.str.contains(r'\blidl\b',re.I,regex=True)]['standard'] = 'Lidl'

print(df)

   id              store  standard_name
0   1          McDonalds       NaN
1   2               Lidl       NaN
2   3  Lidl New York 123       NaN
3   4               KFC        NaN
4   5    Taco Restaurant       NaN
5   6        Lidl Berlin       NaN
6   7            Popeyes       NaN
7   8           Wallmart       NaN
8   9               Aldi       NaN
9  10        London Lidl       NaN

Nothing has been inserted.没有插入任何内容。 I checked just the str.contains code alone, and found it all returned false:我只检查了str.contains代码,发现它都返回了 false:

df.store.str.contains(r'\blidl\b',re.I,regex=True)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: store, dtype: bool

I'm not sure what's happening here.我不确定这里发生了什么。

What I am trying to end up with is the standardized names filled in like this:我试图结束的是这样填写的标准化名称:

   id              store  standard_name
0   1          McDonalds       NaN
1   2               Lidl       Lidl       
2   3  Lidl New York 123       Lidl       
3   4               KFC        NaN
4   5    Taco Restaurant       NaN
5   6        Lidl Berlin       Lidl       
6   7            Popeyes       NaN
7   8           Wallmart       NaN
8   9               Aldi       NaN
9  10        London Lidl       Lidl       

I will be trying to standardize the majority of business names in the dataset, mcdonalds, burger king etc etc. Any help appreciated我将尝试标准化数据集中的大多数企业名称,麦当劳,汉堡王等。感谢任何帮助

Also, is this the fastest way to do this?另外,这是最快的方法吗? There are millions of rows to process.有数百万行要处理。

If want set new column you can use DataFrame.loc with case=False or re.I :如果想设置新列,您可以使用DataFrame.loc with case=Falsere.I

Notice: d['standard_name'] = pd.np.nan is not necessary, you can omit it.注意: d['standard_name'] = pd.np.nan不是必须的,可以省略。

df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
#alternative
#df.loc[df.store.str.contains(r'\blidl\b', flags=re.I), 'standard'] = 'Lidl'
print (df)
   id              store standard
0   1          McDonalds      NaN
1   2               Lidl     Lidl
2   3  Lidl New York 123     Lidl
3   4               KFC       NaN
4   5    Taco Restaurant      NaN
5   6        Lidl Berlin     Lidl
6   7            Popeyes      NaN
7   8           Wallmart      NaN
8   9               Aldi      NaN
9  10        London Lidl     Lidl

Or is possible use another approach - Series.str.extract :或者可以使用另一种方法 - Series.str.extract

df['standard'] = df['store'].str.extract(r'(?i)(\blidl\b)')
#alternative
#df['standard'] = df['store'].str.extract(r'(\blidl\b)', re.I)
print (df)
   id              store standard
0   1          McDonalds      NaN
1   2               Lidl     Lidl
2   3  Lidl New York 123     Lidl
3   4               KFC       NaN
4   5    Taco Restaurant      NaN
5   6        Lidl Berlin     Lidl
6   7            Popeyes      NaN
7   8           Wallmart      NaN
8   9               Aldi      NaN
9  10        London Lidl     Lidl

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM