[英]How to filter a dataframe based on the values present in the list in the rows of a column in Python?
I have a dataframe which looks like: 我有一个数据框,看起来像:
business_id stars categories
0 abcd 4.0 ['Nightlife']
1 abcd1 3.5 ['Pizza', 'Restaurants']
2 abcd2 4.5 ['Groceries', 'Food']
I want to filter the dataframe based on the values present in the categories column. 我想根据类别列中的值过滤数据框。 My dataframe has approximately 400 000 rows and I only want the rows having categories 'Food' or 'Restaurants' in them.
我的数据框有大约400 000行,我只希望其中包含“Food”或“Restaurants”类别的行。
I tried a lot of methods, including: 我尝试了很多方法,包括:
def foodie(x):
for row in x.itertuples():
if 'Food' in row[3] or 'Restaurant' in row[3]:
return x
df = df.apply(foodie, axis=1)
But this is obviously very very bad method since, I am using itertuples on 400 000 rows and my system goes on processing for infinite amount of time. 但这显然是非常非常糟糕的方法,因为我在400 000行上使用itertuples,我的系统继续处理无限的时间。
I also tried using list comprehension in df[df['categories']]
. 我也尝试在
df[df['categories']]
使用列表理解。 But couldn't, since they all are filtering like df[df['stars']==4.0]
. 但是不能,因为它们都像
df[df['stars']==4.0]
那样过滤。 And even all the apply()
methods I saw, were being implemented for columns having single value in their columns. 甚至我看到的所有
apply()
方法都是针对列中具有单个值的列实现的。
So, how can I subset my dataframe using a fairly fast implementation of iterating over my rows and at the same time, select only those rows which have 'Food' or 'Restaurants' in their category? 那么,我如何使用相当快速的迭代实现对我的数据框进行子集化,同时只选择那些在其类别中具有“食物”或“餐馆”的行?
You can use the apply
method on the categories column and check if each element contains the Food
or Restaurants
based on which create a logic index array for subsetting: 您可以在categories列上使用
apply
方法,并检查每个元素是否包含Food
或Restaurants
根据这些元素创建用于子集化的逻辑索引数组:
df.loc[df.categories.apply(lambda cat: 'Food' in cat or 'Restaurants' in cat)]
# business_id categories stars
# 1 abcd1 [Pizza, Restaurants] 3.5
# 2 abcd2 [Groceries, Food] 4.5
Just another idea. 只是另一个想法。 Keep strings instead of list objects.
保留字符串而不是列表对象。
In [2]: import pandas as pd
In [3]: data = {'business_id':['abcd','abcd1','abcd2'],'stars': [4.0,3.5,4.5],'categories':[['Nightlife'],['Pizza', 'Restaurants'],['Groceries', 'Food']]}
# convert list to string with join() method
In [15]: df.categories = df.categories.apply(",".join)
In [16]: df
Out[16]:
business_id categories stars
0 abcd Nightlife 4.0
1 abcd1 Pizza,Restaurants 3.5
2 abcd2 Groceries,Food 4.5
In [26]: df.categories.str.contains('Food')
Out[26]:
0 False
1 False
2 True
Name: categories, dtype: bool
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.