Indexing by str.contains(), then inserting a value into another column

Question

I have a dataframe of store names that I have to standardize. For example McDonalds 1234 LA -> McDonalds .

import pandas as pd
import re

df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype='int64',index=pd.RangeIndex(start=0, stop=10, step=1)), 'store': pd.Series(['McDonalds', 'Lidl', 'Lidl New York 123', 'KFC ', 'Taco Restaurant', 'Lidl Berlin', 'Popeyes', 'Wallmart', 'Aldi', 'London Lidl'],dtype='object',index=pd.RangeIndex(start=0, stop=10, step=1))}, index=pd.RangeIndex(start=0, stop=10, step=1))

print(df)

   id              store
0   1          McDonalds
1   2               Lidl
2   3  Lidl New York 123
3   4               KFC 
4   5    Taco Restaurant
5   6        Lidl Berlin
6   7            Popeyes
7   8           Wallmart
8   9               Aldi
9  10        London Lidl

So let's say I want to standardize the Lidl stores. The standard name will just be "Lidl.

I would like find where Lidl is in the dataframe, and to create a new column df['standard_name'] and insert the standard name there. However I can't figure this out.

I'll first create the column where the standard name will be inserted:

d['standard_name'] = pd.np.nan

Then search for instances of Lidl , and insert the cleaned name into standard_name .

First of all the plan is to use str.contains and then set the standardized value to the new column:

df[df.store.str.contains(r'\blidl\b',re.I,regex=True)]['standard'] = 'Lidl'

print(df)

   id              store  standard_name
0   1          McDonalds       NaN
1   2               Lidl       NaN
2   3  Lidl New York 123       NaN
3   4               KFC        NaN
4   5    Taco Restaurant       NaN
5   6        Lidl Berlin       NaN
6   7            Popeyes       NaN
7   8           Wallmart       NaN
8   9               Aldi       NaN
9  10        London Lidl       NaN

Nothing has been inserted. I checked just the str.contains code alone, and found it all returned false:

df.store.str.contains(r'\blidl\b',re.I,regex=True)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: store, dtype: bool

I'm not sure what's happening here.

What I am trying to end up with is the standardized names filled in like this:

   id              store  standard_name
0   1          McDonalds       NaN
1   2               Lidl       Lidl       
2   3  Lidl New York 123       Lidl       
3   4               KFC        NaN
4   5    Taco Restaurant       NaN
5   6        Lidl Berlin       Lidl       
6   7            Popeyes       NaN
7   8           Wallmart       NaN
8   9               Aldi       NaN
9  10        London Lidl       Lidl

I will be trying to standardize the majority of business names in the dataset, mcdonalds, burger king etc etc. Any help appreciated

Also, is this the fastest way to do this? There are millions of rows to process.

Answer 1

If want set new column you can use DataFrame.loc with case=False or re.I :

Notice: d['standard_name'] = pd.np.nan is not necessary, you can omit it.

df.loc[df.store.str.contains(r'\blidl\b', case=False), 'standard'] = 'Lidl'
#alternative
#df.loc[df.store.str.contains(r'\blidl\b', flags=re.I), 'standard'] = 'Lidl'
print (df)
   id              store standard
0   1          McDonalds      NaN
1   2               Lidl     Lidl
2   3  Lidl New York 123     Lidl
3   4               KFC       NaN
4   5    Taco Restaurant      NaN
5   6        Lidl Berlin     Lidl
6   7            Popeyes      NaN
7   8           Wallmart      NaN
8   9               Aldi      NaN
9  10        London Lidl     Lidl

Or is possible use another approach - Series.str.extract :

df['standard'] = df['store'].str.extract(r'(?i)(\blidl\b)')
#alternative
#df['standard'] = df['store'].str.extract(r'(\blidl\b)', re.I)
print (df)
   id              store standard
0   1          McDonalds      NaN
1   2               Lidl     Lidl
2   3  Lidl New York 123     Lidl
3   4               KFC       NaN
4   5    Taco Restaurant      NaN
5   6        Lidl Berlin     Lidl
6   7            Popeyes      NaN
7   8           Wallmart      NaN
8   9               Aldi      NaN
9  10        London Lidl     Lidl

Indexing by str.contains(), then inserting a value into another column

Question

1 answers

solution1
3 ACCPTED 2020-01-17 14:05:40

Indexing by str.contains(), then inserting a value into another column

Question

1 answers

solution1 3 ACCPTED 2020-01-17 14:05:40

solution1
3 ACCPTED 2020-01-17 14:05:40