![](/img/trans.png)
[英]How to filter a dataframe based on the values present in the list in the rows of a column in Python?
[英]How to group by according to the values of a list present in a column in dataframe python
我有像這樣的熊貓電影的數據幀
id, name, genre, release_year
1 A [a,b,c] 2017
2 B [b,c] 2017
3 C [a,c] 2010
4 D [d,c] 2010
....
我想根據流派列表中的值來分組電影。 我的預期輸出是:
year, genre, number_of_movies
2017 a 1
2017 b 2
2017 c 2
2010 a 1
2010 c 2
...
有人可以幫我實現這個目標嗎?
您可以創建新的DataFrame
由構造器,通過重塑stack
和計數使用groupby
與size
:
df1 = (pd.DataFrame(df['genre'].values.tolist(), index=df['release_year'].values)
.stack()
.reset_index(name='genre')
.groupby(['release_year','genre'])
.size()
.reset_index(name='number_of_movies'))
print (df1)
release_year genre number_of_movies
0 2010 a 1
1 2010 c 2
2 2010 d 1
3 2017 a 1
4 2017 b 2
5 2017 c 2
為了提高性能,請使用itertools.chain
展平genre
列:
from itertools import chain
df = pd.DataFrame({
'genre' : list(
chain.from_iterable(df.genre.tolist())
),
'release_year' : df.release_year.repeat(df.genre.str.len())
})
df
genre release_year
0 a 2017
0 b 2017
0 c 2017
1 b 2017
1 c 2017
2 a 2010
2 c 2010
3 d 2010
3 c 2010
現在,對genre
和release_year
分組,找到每個組的size
:
df.groupby(
['genre', 'release_year'], sort=False
).size()\
.reset_index(name='number_of_movies')
genre release_year number_of_movies
0 a 2017 1
1 b 2017 2
2 c 2017 2
3 a 2010 1
4 c 2010 2
5 d 2010 1
另一種很酷的方法是使用Counter
ie
from collections import Counter
ndf = df.groupby('release_year')['genre'].apply(lambda x : Counter(np.concatenate(x.values))).reset_index()
ndf = ndf.set_axis('release_year,genre,number_of_movies'.split(','),inplace=False,axis=1)
輸出:
release_year genre number_of_movies
0 2010 a 1.0
1 2010 c 2.0
2 2010 d 1.0
3 2017 a 1.0
4 2017 b 2.0
5 2017 c 2.0
這是一個collections.Counter
方法,它具有O(n)復雜度,並且不需要df.groupby
/ df.apply
:
from collections import Counter
from itertools import product, chain
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3, 4],
'name': ['A', 'B', 'C', 'D'],
'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
'year': [2017, 2017, 2010, 2010]})
c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
for idx, x in df.iterrows()]))
# Counter({(2010, 'a'): 1,
# (2010, 'c'): 2,
# (2010, 'd'): 1,
# (2017, 'a'): 1,
# (2017, 'b'): 2,
# (2017, 'c'): 2})
df = pd.DataFrame.from_dict(c, orient='index')
# 0
# (2017, a) 1
# (2017, b) 2
# (2017, c) 2
# (2010, a) 1
# (2010, c) 2
# (2010, d) 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.