查询 Pandas dataframe 以获取扩展包含的列中的 EXACT 字词

Question

具有 dataframe df 和以下列：

Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')

我有兴趣只获取synonyms_text中包含单词food而不是seafood的行，例如：

df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]

有以下结果（其中包含海鲜、foodlocker 和其他不需要的）：

           category   synonyms_text  \
130          Fishing  seafarm, seafood, shellfish, sportfish   
141   Refrigeration   coldstorage, foodlocker, freeze, fridge, ice, refrigeration   
183     Food Service  cook, fastfood, foodserve, foodservice, foodtruck, mealprep   
200       Restaurant  expresso, food, galley, gastropub, grill, java, kitchen
377         fastfood  carryout, fastfood, takeout
379  Animal Supplies  feed, fodder, grain, hay, petfood   
613            store  convenience, food, grocer, grocery, market

然后，我将结果发送到一个列表，以获取食物作为单词：

food_l=df_text['synonyms_text'].str.split().tolist()

但是，我得到的列表值如下：

['carryout,', 'fastfood,', 'takeout']

所以，我去掉逗号：

food_l= [[x.replace(",","") for x in l]for l in food_l]

然后，最后我会从列表列表中得到food这个词：

food_l= [[l for x in l if "food"==x]for l in food_l]

之后，我摆脱了空列表：

food_l= [x for x in food_l if x != []]

最后，我将列表列表展平以获得最终结果：

food_l = [item for sublist in food_l for item in sublist]

最终结果如下：

[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]

@Erfan 这个 dataframe 可以用作测试：

df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})

两者都给空：

df_tmp=  df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]

你知道一个更好的方法来获得一个单词food的行而不经历所有这些痛苦的过程吗？ 我们是否有其他 function 不同于包含在 dataframe 中查找 dataframe 的值的完全匹配？

谢谢

Answer 1

示例 dataframe：

df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
                   'synonyms_text':['seafood','foodlocker','food']})

print(df)
        category synonyms_text
0        Fishing       seafood
1  Refrigeration    foodlocker
2          store          food # <-- we want only the rows with exact "food"

我们可以通过三种方式做到这一点：

str.match
str.contains
str.extract （在这里不是很有用）

# 1
df['synonyms_text'].str.match(r'\bfood\b')

# 2 
df['synonyms_text'].str.match(r'\bfood\b')

# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')

output

0    False
1    False
2     True
Name: synonyms_text, dtype: bool

最后我们用boolean系列过滤掉dataframe .loc

m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]

output

  category synonyms_text
2    store          food

奖金：

要匹配不区分大小写的使用?i ：

例如：

df['synonyms_text'].str.match(r'\b(?i)food\b')

哪个将匹配： food , Food , FOOD , fOoD

查询 Pandas dataframe 以获取扩展包含的列中的 EXACT 字词

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-10-31 00:00:39

查询 Pandas dataframe 以获取扩展包含的列中的 EXACT 字词

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-10-31 00:00:39

解决方案1
1 已采纳 2019-10-31 00:00:39