How to filter a dataframe based on the values present in the list in the rows of a column in Python?

Question

I have a dataframe which looks like:

   business_id  stars  categories
0  abcd         4.0    ['Nightlife']
1  abcd1        3.5    ['Pizza', 'Restaurants']
2  abcd2        4.5    ['Groceries', 'Food']

I want to filter the dataframe based on the values present in the categories column. My dataframe has approximately 400 000 rows and I only want the rows having categories 'Food' or 'Restaurants' in them.

I tried a lot of methods, including:

def foodie(x):
    for row in x.itertuples():
        if 'Food' in row[3] or 'Restaurant' in row[3]:
            return x

df = df.apply(foodie, axis=1)

But this is obviously very very bad method since, I am using itertuples on 400 000 rows and my system goes on processing for infinite amount of time.

I also tried using list comprehension in df[df['categories']] . But couldn't, since they all are filtering like df[df['stars']==4.0] . And even all the apply() methods I saw, were being implemented for columns having single value in their columns.

So, how can I subset my dataframe using a fairly fast implementation of iterating over my rows and at the same time, select only those rows which have 'Food' or 'Restaurants' in their category?

Answer 1

You can use the apply method on the categories column and check if each element contains the Food or Restaurants based on which create a logic index array for subsetting:

df.loc[df.categories.apply(lambda cat: 'Food' in cat or 'Restaurants' in cat)]

#     business_id             categories      stars
# 1         abcd1   [Pizza, Restaurants]        3.5
# 2         abcd2      [Groceries, Food]        4.5

Answer 2

Just another idea. Keep strings instead of list objects.

In [2]: import pandas as pd

In [3]: data = {'business_id':['abcd','abcd1','abcd2'],'stars':    [4.0,3.5,4.5],'categories':[['Nightlife'],['Pizza', 'Restaurants'],['Groceries', 'Food']]}
# convert list to string with join() method
In [15]: df.categories = df.categories.apply(",".join)

In [16]: df 
Out[16]: 
  business_id         categories  stars
0        abcd          Nightlife    4.0
1       abcd1  Pizza,Restaurants    3.5
2       abcd2     Groceries,Food    4.5

In [26]: df.categories.str.contains('Food')
Out[26]: 
0    False
1    False
2     True
Name: categories, dtype: bool

How to filter a dataframe based on the values present in the list in the rows of a column in Python?

Question

2 answers

solution1
3 ACCPTED 2016-06-25 20:29:55

solution2
0 2016-06-27 07:55:13

How to filter a dataframe based on the values present in the list in the rows of a column in Python?

Question

2 answers

solution1 3 ACCPTED 2016-06-25 20:29:55

solution2 0 2016-06-27 07:55:13

solution1
3 ACCPTED 2016-06-25 20:29:55

solution2
0 2016-06-27 07:55:13