简体   繁体   中英

How to filter a dataframe based on the values present in the list in the rows of a column in Python?

I have a dataframe which looks like:

   business_id  stars  categories
0  abcd         4.0    ['Nightlife']
1  abcd1        3.5    ['Pizza', 'Restaurants']
2  abcd2        4.5    ['Groceries', 'Food']

I want to filter the dataframe based on the values present in the categories column. My dataframe has approximately 400 000 rows and I only want the rows having categories 'Food' or 'Restaurants' in them.

I tried a lot of methods, including:

def foodie(x):
    for row in x.itertuples():
        if 'Food' in row[3] or 'Restaurant' in row[3]:
            return x

df = df.apply(foodie, axis=1)

But this is obviously very very bad method since, I am using itertuples on 400 000 rows and my system goes on processing for infinite amount of time.

I also tried using list comprehension in df[df['categories']] . But couldn't, since they all are filtering like df[df['stars']==4.0] . And even all the apply() methods I saw, were being implemented for columns having single value in their columns.

So, how can I subset my dataframe using a fairly fast implementation of iterating over my rows and at the same time, select only those rows which have 'Food' or 'Restaurants' in their category?

You can use the apply method on the categories column and check if each element contains the Food or Restaurants based on which create a logic index array for subsetting:

df.loc[df.categories.apply(lambda cat: 'Food' in cat or 'Restaurants' in cat)]

#     business_id             categories      stars
# 1         abcd1   [Pizza, Restaurants]        3.5
# 2         abcd2      [Groceries, Food]        4.5

Just another idea. Keep strings instead of list objects.

In [2]: import pandas as pd

In [3]: data = {'business_id':['abcd','abcd1','abcd2'],'stars':    [4.0,3.5,4.5],'categories':[['Nightlife'],['Pizza', 'Restaurants'],['Groceries', 'Food']]}
# convert list to string with join() method
In [15]: df.categories = df.categories.apply(",".join)

In [16]: df 
Out[16]: 
  business_id         categories  stars
0        abcd          Nightlife    4.0
1       abcd1  Pizza,Restaurants    3.5
2       abcd2     Groceries,Food    4.5

In [26]: df.categories.str.contains('Food')
Out[26]: 
0    False
1    False
2     True
Name: categories, dtype: bool

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM