Query Pandas dataframe for EXACT word in a column expanding contains

Question

Having a dataframe df with the following columns:

Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')

I am interested on getting just the rows containing in synonyms_text just the word food and not seafood for instance:

df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]

Having the following result (which contains seafood, foodlocker and others that are not wanted):

           category   synonyms_text  \
130          Fishing  seafarm, seafood, shellfish, sportfish   
141   Refrigeration   coldstorage, foodlocker, freeze, fridge, ice, refrigeration   
183     Food Service  cook, fastfood, foodserve, foodservice, foodtruck, mealprep   
200       Restaurant  expresso, food, galley, gastropub, grill, java, kitchen
377         fastfood  carryout, fastfood, takeout
379  Animal Supplies  feed, fodder, grain, hay, petfood   
613            store  convenience, food, grocer, grocery, market

Then, I sent the result to a list to get just food as word:

food_l=df_text['synonyms_text'].str.split().tolist()

However, I am getting in the list values as the following:

['carryout,', 'fastfood,', 'takeout']

so, I get rid of commas:

food_l= [[x.replace(",","") for x in l]for l in food_l]

Then, finally I will get just the word food from the lists of list:

food_l= [[l for x in l if "food"==x]for l in food_l]

After, I get rid of empty lists:

food_l= [x for x in food_l if x != []]

Finally, I flatten the lists of list to get the final result:

food_l = [item for sublist in food_l for item in sublist]

And the final result is as follows:

[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]

@Erfan This dataframe can be used as test:

df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})

Both give empty:

df_tmp=  df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]

Do you know a better way to get just the rows with the single word food without going through all this painful process? Do we have other function different to contains to look in the dataframe for an exact match in the values of the dataframe?

Thanks

Answer 1

Example dataframe:

df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
                   'synonyms_text':['seafood','foodlocker','food']})

print(df)
        category synonyms_text
0        Fishing       seafood
1  Refrigeration    foodlocker
2          store          food # <-- we want only the rows with exact "food"

Three ways we can do this:

str.match
str.contains
str.extract (not very useful here)

# 1
df['synonyms_text'].str.match(r'\bfood\b')

# 2 
df['synonyms_text'].str.match(r'\bfood\b')

# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')

output

0    False
1    False
2     True
Name: synonyms_text, dtype: bool

Finally we use boolean series to filter out dataframe .loc

m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]

output

  category synonyms_text
2    store          food

Bonus :

To match case insensitive use ?i :

For example:

df['synonyms_text'].str.match(r'\b(?i)food\b')

Which will match: food , Food , FOOD , fOoD

Query Pandas dataframe for EXACT word in a column expanding contains

Question

1 answers

solution1
1 ACCPTED 2019-10-31 00:00:39

Query Pandas dataframe for EXACT word in a column expanding contains

Question

1 answers

solution1 1 ACCPTED 2019-10-31 00:00:39

solution1
1 ACCPTED 2019-10-31 00:00:39