[英]Pandas count frequencies within str series
Given a Pandas Series of type str, I want to get the frequencies of the result returned by str.split. 给定一个类型为str的Pandas系列,我想获得str.split返回结果的频率。
For example, given the Series 例如,给定系列
s = pd.Series(['abc,def,ghi','ghi,abc'])
I would like to get 我想得到
abc: 2
def: 1
ghi: 2
as a result. 结果是。 How can I get this? 我怎么能得到这个?
Edit: The solution should efficiently work with a large Series of 50 million rows. 编辑:该解决方案应该有效地处理具有5000万行的大型系列。
is that what you want? 那是你要的吗?
In [29]: from collections import Counter
In [30]: Counter(s.str.split(',').sum())
Out[30]: Counter({'abc': 2, 'def': 1, 'ghi': 2})
or 要么
In [34]: a = pd.Series(s.str.split(',').sum())
In [35]: a
Out[35]:
0 abc
1 def
2 ghi
3 ghi
4 abc
dtype: object
In [36]: a.groupby(a).size()
Out[36]:
abc 2
def 1
ghi 2
dtype: int64
Another pandas solution with str.split
, sum
and value_counts
: 另一个具有str.split
, sum
和value_counts
熊猫解决方案:
print pd.Series(s.str.split(',').sum()).value_counts()
abc 2
ghi 2
def 1
dtype: int64
EDIT: 编辑:
More efficent methods: 更有效的方法:
import pandas as pd
s = pd.Series(['abc,def,ghi','ghi,abc'])
s = pd.concat([s]*10000).reset_index(drop=True)
In [17]: %timeit pd.Series(s.str.split(',').sum()).value_counts()
1 loops, best of 3: 3.1 s per loop
In [18]: %timeit s.str.split(',', expand=True).stack().value_counts()
10 loops, best of 3: 46.2 ms per loop
In [19]: %timeit pd.DataFrame([ x.split(',') for x in s.tolist() ]).stack().value_counts()
10 loops, best of 3: 22.2 ms per loop
In [20]: %timeit pd.Series([item for sublist in [ x.split(',') for x in s.tolist() ] for item in sublist]).value_counts()
100 loops, best of 3: 16.6 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.