I have a dataframe which looks like:
business_id stars categories
0 abcd 4.0 ['Nightlife']
1 abcd1 3.5 ['Pizza', 'Restaurants']
2 abcd2 4.5 ['Groceries', 'Food']
I want to filter the dataframe based on the values present in the categories column. My dataframe has approximately 400 000 rows and I only want the rows having categories 'Food' or 'Restaurants' in them.
I tried a lot of methods, including:
def foodie(x):
for row in x.itertuples():
if 'Food' in row[3] or 'Restaurant' in row[3]:
return x
df = df.apply(foodie, axis=1)
But this is obviously very very bad method since, I am using itertuples on 400 000 rows and my system goes on processing for infinite amount of time.
I also tried using list comprehension in df[df['categories']]
. But couldn't, since they all are filtering like df[df['stars']==4.0]
. And even all the apply()
methods I saw, were being implemented for columns having single value in their columns.
So, how can I subset my dataframe using a fairly fast implementation of iterating over my rows and at the same time, select only those rows which have 'Food' or 'Restaurants' in their category?
You can use the apply
method on the categories column and check if each element contains the Food
or Restaurants
based on which create a logic index array for subsetting:
df.loc[df.categories.apply(lambda cat: 'Food' in cat or 'Restaurants' in cat)]
# business_id categories stars
# 1 abcd1 [Pizza, Restaurants] 3.5
# 2 abcd2 [Groceries, Food] 4.5
Just another idea. Keep strings instead of list objects.
In [2]: import pandas as pd
In [3]: data = {'business_id':['abcd','abcd1','abcd2'],'stars': [4.0,3.5,4.5],'categories':[['Nightlife'],['Pizza', 'Restaurants'],['Groceries', 'Food']]}
# convert list to string with join() method
In [15]: df.categories = df.categories.apply(",".join)
In [16]: df
Out[16]:
business_id categories stars
0 abcd Nightlife 4.0
1 abcd1 Pizza,Restaurants 3.5
2 abcd2 Groceries,Food 4.5
In [26]: df.categories.str.contains('Food')
Out[26]:
0 False
1 False
2 True
Name: categories, dtype: bool
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.