[英]Indexing by str.contains(), then inserting a value into another column
I have a dataframe of store names that I have to standardize.我有一个必须标准化的商店名称数据框。 For example
McDonalds 1234 LA
-> McDonalds
.例如
McDonalds 1234 LA
-> McDonalds
。
import pandas as pd
import re
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC ', 'Taco Restaurant', 'Lidl Berlin', 'Popeyes', 'Wallmart', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))
print(df)
id store
0 1 McDonalds
1 2 Lidl
2 3 Lidl New York 123
3 4 KFC
4 5 Taco Restaurant
5 6 Lidl Berlin
6 7 Popeyes
7 8 Wallmart
8 9 Aldi
9 10 London Lidl
So let's say I want to standardize the Lidl stores.所以假设我想标准化Lidl商店。 The standard name will just be "Lidl.
标准名称将只是“Lidl。
I would like find where Lidl is in the dataframe, and to create a new column df['standard_name']
and insert the standard name there.我想找到 Lidl 在数据框中的位置,并创建一个新列
df['standard_name']
并在那里插入标准名称。 However I can't figure this out.但是我无法弄清楚这一点。
I'll first create the column where the standard name will be inserted:我将首先创建将插入标准名称的列:
d['standard_name'] = pd.np.nan
Then search for instances of Lidl , and insert the cleaned name into standard_name
.然后搜索Lidl 的实例,并将清理过的名称插入到
standard_name
。
First of all the plan is to use str.contains
and then set the standardized value to the new column:首先的计划是使用
str.contains
然后将标准化值设置为新列:
df[df.store.str.contains(r'\blidl\b',re.I,regex=True)]['standard'] = 'Lidl'
print(df)
id store standard_name
0 1 McDonalds NaN
1 2 Lidl NaN
2 3 Lidl New York 123 NaN
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin NaN
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl NaN
Nothing has been inserted.没有插入任何内容。 I checked just the
str.contains
code alone, and found it all returned false:我只检查了
str.contains
代码,发现它都返回了 false:
df.store.str.contains(r'\blidl\b',re.I,regex=True)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: store, dtype: bool
I'm not sure what's happening here.我不确定这里发生了什么。
What I am trying to end up with is the standardized names filled in like this:我试图结束的是这样填写的标准化名称:
id store standard_name
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
I will be trying to standardize the majority of business names in the dataset, mcdonalds, burger king etc etc. Any help appreciated我将尝试标准化数据集中的大多数企业名称,麦当劳,汉堡王等。感谢任何帮助
Also, is this the fastest way to do this?另外,这是最快的方法吗? There are millions of rows to process.
有数百万行要处理。
If want set new column you can use DataFrame.loc
with case=False
or re.I
:如果想设置新列,您可以使用
DataFrame.loc
with case=False
或re.I
:
Notice: d['standard_name'] = pd.np.nan
is not necessary, you can omit it.注意:
d['standard_name'] = pd.np.nan
不是必须的,可以省略。
df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
#alternative
#df.loc[df.store.str.contains(r'\blidl\b', flags=re.I), 'standard'] = 'Lidl'
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
Or is possible use another approach - Series.str.extract
:或者可以使用另一种方法 -
Series.str.extract
:
df['standard'] = df['store'].str.extract(r'(?i)(\blidl\b)')
#alternative
#df['standard'] = df['store'].str.extract(r'(\blidl\b)', re.I)
print (df)
id store standard
0 1 McDonalds NaN
1 2 Lidl Lidl
2 3 Lidl New York 123 Lidl
3 4 KFC NaN
4 5 Taco Restaurant NaN
5 6 Lidl Berlin Lidl
6 7 Popeyes NaN
7 8 Wallmart NaN
8 9 Aldi NaN
9 10 London Lidl Lidl
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.