Having a dataframe df with the following columns:
Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')
I am interested on getting just the rows containing in synonyms_text
just the word food
and not seafood
for instance:
df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]
Having the following result (which contains seafood, foodlocker and others that are not wanted):
category synonyms_text \
130 Fishing seafarm, seafood, shellfish, sportfish
141 Refrigeration coldstorage, foodlocker, freeze, fridge, ice, refrigeration
183 Food Service cook, fastfood, foodserve, foodservice, foodtruck, mealprep
200 Restaurant expresso, food, galley, gastropub, grill, java, kitchen
377 fastfood carryout, fastfood, takeout
379 Animal Supplies feed, fodder, grain, hay, petfood
613 store convenience, food, grocer, grocery, market
Then, I sent the result to a list to get just food as word:
food_l=df_text['synonyms_text'].str.split().tolist()
However, I am getting in the list values as the following:
['carryout,', 'fastfood,', 'takeout']
so, I get rid of commas:
food_l= [[x.replace(",","") for x in l]for l in food_l]
Then, finally I will get just the word food
from the lists of list:
food_l= [[l for x in l if "food"==x]for l in food_l]
After, I get rid of empty lists:
food_l= [x for x in food_l if x != []]
Finally, I flatten the lists of list to get the final result:
food_l = [item for sublist in food_l for item in sublist]
And the final result is as follows:
[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]
@Erfan This dataframe can be used as test:
df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})
Both give empty:
df_tmp= df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]
Do you know a better way to get just the rows with the single word food
without going through all this painful process? Do we have other function different to contains to look in the dataframe for an exact match in the values of the dataframe?
Thanks
Example dataframe:
df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
'synonyms_text':['seafood','foodlocker','food']})
print(df)
category synonyms_text
0 Fishing seafood
1 Refrigeration foodlocker
2 store food # <-- we want only the rows with exact "food"
Three ways we can do this:
str.match
str.contains
str.extract
(not very useful here) # 1
df['synonyms_text'].str.match(r'\bfood\b')
# 2
df['synonyms_text'].str.match(r'\bfood\b')
# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')
output
0 False
1 False
2 True
Name: synonyms_text, dtype: bool
Finally we use boolean
series to filter out dataframe .loc
m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]
output
category synonyms_text
2 store food
Bonus :
To match case insensitive use ?i
:
For example:
df['synonyms_text'].str.match(r'\b(?i)food\b')
Which will match: food
, Food
, FOOD
, fOoD
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.