列表的熊貓列的頻率計數

Question

我有一個熊貓DataFrame，一列包含一個用管道分隔的字符串。 這些來自電影流派。 他們看起來像這樣：

Genre
Adventure|Animation|Children|Comedy|Fantasy
Comedy|Romance
...

我使用str.split將它們作為列表返回到單元格中。 像這樣：

Genre 
[Adventure, Animation, Children, Comedy, Fantasy]
[Adventure, Children, Fantasy]
[Comedy, Romance]
[Comedy, Drama, Romance]
[Comedy]

我想總結所有類型的作品。 例如，喜劇出現了多少次？ Adventure等做了多少次？ 我似乎無法弄清楚。

這看起來像

Comedy    4
Adventure 2
Animation 1
(...and so on...)

Answer 1

我也贊成將chain + for 。

只是為了證明這一點，另一種可能的方法是使用get_dummies

df['Genre'].str.get_dummies(sep='|').sum()

Answer 2

作為for循環俱樂部的人，我建議使用python的C加速例程（ itertools.chain和collections.Counter提高性能。

from itertools import chain
from collections import Counter

pd.Series(
    Counter(chain.from_iterable(x.split('|') for x in df.Genre)))

Adventure    1
Animation    1
Children     1
Comedy       2
Fantasy      1
Romance      1
dtype: int64

為什么我認為CPython函數比熊貓“矢量化”字符串函數更好？ 它們天生就難以向量化。 您可以在For循環與熊貓上閱讀更多內容-我什么時候應該關心？ 。

如果必須處理NaN，則可以調用一個優雅處理異常的函數：

def try_split(x):
    try:
        return x.split('|')
    except AttributeError:
        return []

pd.Series(
    Counter(chain.from_iterable(try_split(x) for x in df.Genre)))

通常，您可以使用split ， stack和value_counts 。

df['Genre'].str.split('|', expand=True).stack().value_counts()

Comedy       2
Romance      1
Children     1
Animation    1
Fantasy      1
Adventure    1
dtype: int64

即使對於很小的DataFrame，時間差異也很明顯。

%timeit df['Genre'].str.get_dummies(sep='|').sum()
%timeit df['Genre'].str.split('|', expand=True).stack().value_counts()
%%timeit
pd.Series(
    Counter(chain.from_iterable(try_split(x) for x in df.Genre)))

2.8 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
320 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

列表的熊貓列的頻率計數

問題描述

2 個解決方案

解決方案1
3 2019-01-20 21:29:31

解決方案2
2 已采納 2019-01-20 21:24:41

列表的熊貓列的頻率計數

問題描述

2 個解決方案

解決方案1 3 2019-01-20 21:29:31

解決方案2 2 已采納 2019-01-20 21:24:41

解決方案1
3 2019-01-20 21:29:31

解決方案2
2 已采納 2019-01-20 21:24:41