[英]Query Pandas dataframe for EXACT word in a column expanding contains
具有 dataframe df 和以下列:
Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')
我有興趣只獲取synonyms_text
中包含單詞food
而不是seafood
的行,例如:
df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]
有以下結果(其中包含海鮮、foodlocker 和其他不需要的):
category synonyms_text \
130 Fishing seafarm, seafood, shellfish, sportfish
141 Refrigeration coldstorage, foodlocker, freeze, fridge, ice, refrigeration
183 Food Service cook, fastfood, foodserve, foodservice, foodtruck, mealprep
200 Restaurant expresso, food, galley, gastropub, grill, java, kitchen
377 fastfood carryout, fastfood, takeout
379 Animal Supplies feed, fodder, grain, hay, petfood
613 store convenience, food, grocer, grocery, market
然后,我將結果發送到一個列表,以獲取食物作為單詞:
food_l=df_text['synonyms_text'].str.split().tolist()
但是,我得到的列表值如下:
['carryout,', 'fastfood,', 'takeout']
所以,我去掉逗號:
food_l= [[x.replace(",","") for x in l]for l in food_l]
然后,最后我會從列表列表中得到food
這個詞:
food_l= [[l for x in l if "food"==x]for l in food_l]
之后,我擺脫了空列表:
food_l= [x for x in food_l if x != []]
最后,我將列表列表展平以獲得最終結果:
food_l = [item for sublist in food_l for item in sublist]
最終結果如下:
[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]
@Erfan 這個 dataframe 可以用作測試:
df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})
兩者都給空:
df_tmp= df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]
你知道一個更好的方法來獲得一個單詞food
的行而不經歷所有這些痛苦的過程嗎? 我們是否有其他 function 不同於包含在 dataframe 中查找 dataframe 的值的完全匹配?
謝謝
示例 dataframe:
df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
'synonyms_text':['seafood','foodlocker','food']})
print(df)
category synonyms_text
0 Fishing seafood
1 Refrigeration foodlocker
2 store food # <-- we want only the rows with exact "food"
我們可以通過三種方式做到這一點:
str.match
str.contains
str.extract
(在這里不是很有用)# 1
df['synonyms_text'].str.match(r'\bfood\b')
# 2
df['synonyms_text'].str.match(r'\bfood\b')
# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')
output
0 False
1 False
2 True
Name: synonyms_text, dtype: bool
最后我們用boolean
系列過濾掉dataframe .loc
m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]
output
category synonyms_text
2 store food
獎金:
要匹配不區分大小寫的使用?i
:
例如:
df['synonyms_text'].str.match(r'\b(?i)food\b')
哪個將匹配: food
, Food
, FOOD
, fOoD
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.