简体   繁体   中英

Query Pandas dataframe for EXACT word in a column expanding contains

Having a dataframe df with the following columns:

Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')

I am interested on getting just the rows containing in synonyms_text just the word food and not seafood for instance:

df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]

Having the following result (which contains seafood, foodlocker and others that are not wanted):

           category   synonyms_text  \
130          Fishing  seafarm, seafood, shellfish, sportfish   
141   Refrigeration   coldstorage, foodlocker, freeze, fridge, ice, refrigeration   
183     Food Service  cook, fastfood, foodserve, foodservice, foodtruck, mealprep   
200       Restaurant  expresso, food, galley, gastropub, grill, java, kitchen
377         fastfood  carryout, fastfood, takeout
379  Animal Supplies  feed, fodder, grain, hay, petfood   
613            store  convenience, food, grocer, grocery, market

Then, I sent the result to a list to get just food as word:

food_l=df_text['synonyms_text'].str.split().tolist()

However, I am getting in the list values as the following:

['carryout,', 'fastfood,', 'takeout']

so, I get rid of commas:

food_l= [[x.replace(",","") for x in l]for l in food_l]

Then, finally I will get just the word food from the lists of list:

food_l= [[l for x in l if "food"==x]for l in food_l]

After, I get rid of empty lists:

food_l= [x for x in food_l if x != []]

Finally, I flatten the lists of list to get the final result:

food_l = [item for sublist in food_l for item in sublist]

And the final result is as follows:

[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]

@Erfan This dataframe can be used as test:

df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})

Both give empty:

df_tmp=  df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]

Do you know a better way to get just the rows with the single word food without going through all this painful process? Do we have other function different to contains to look in the dataframe for an exact match in the values of the dataframe?

Thanks

Example dataframe:

df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
                   'synonyms_text':['seafood','foodlocker','food']})

print(df)
        category synonyms_text
0        Fishing       seafood
1  Refrigeration    foodlocker
2          store          food # <-- we want only the rows with exact "food"

Three ways we can do this:

  1. str.match
  2. str.contains
  3. str.extract (not very useful here)
# 1
df['synonyms_text'].str.match(r'\bfood\b')
# 2 
df['synonyms_text'].str.match(r'\bfood\b')
# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')

output

0    False
1    False
2     True
Name: synonyms_text, dtype: bool

Finally we use boolean series to filter out dataframe .loc

m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]

output

  category synonyms_text
2    store          food

Bonus :

To match case insensitive use ?i :

For example:

df['synonyms_text'].str.match(r'\b(?i)food\b')

Which will match: food , Food , FOOD , fOoD

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM