[英]Extract and count unique hashtags per row from a pandas dataframe
I have a pandas dataframe df
with a string column Posts
, something like this: 我有一个带有字符串列Posts
的pandas dataframe df
,如下所示:
df['Posts']
0 "This is an example #tag1"
1 "This too is an example #tag1 #tag2"
2 "Yup, still an example #tag1 #tag1 #tag3"
When I tried using the following code to count the number of hashtags, 当我尝试使用以下代码来计算主题标签的数量时,
count_hashtags = df['Posts'].str.extractall(r'(\#\w+)')[0].value_counts()
I get, 我明白了
#tag1 4
#tag2 1
#tag3 1
But I need the result to be count of unique hashtags per row, something like this: 但是我需要将结果计算为每行唯一的标签,如下所示:
#tag1 3
#tag2 1
#tag3 1
use drop_duplicates
to get rid of duplicate tags per post, and then you can use value_counts
使用drop_duplicates
删除每个帖子的重复标签,然后你可以使用value_counts
df.Posts.str.extractall(
r'(\#\w+)'
).reset_index().drop_duplicates(['level_0', 0])[0].value_counts()
shorter alternative where level=0
is passed to reset_index
将level=0
传递给reset_index
较短替代
df.Posts.str.extractall(
r'(\#\w+)'
).reset_index(level=0).drop_duplicates()[0].value_counts()
both will output: 两者都会输出:
#tag1 3
#tag3 1
#tag2 1
Name: 0, dtype: int64
This is one solution using itertools.chain
and collections.Counter
: 这是使用itertools.chain
和collections.Counter
一个解决方案:
import pandas as pd
from collections import Counter
from itertools import chain
s = pd.Series(['This is an example #tag1',
'This too is an example #tag1 #tag2',
'Yup, still an example #tag1 #tag1 #tag3'])
tags = s.map(lambda x: {i[1:] for i in x.split() if i.startswith('#')})
res = Counter(chain.from_iterable(tags))
print(res)
Counter({'tag1': 3, 'tag2': 1, 'tag3': 1})
Performance benchmarking 绩效基准
collections.Counter
is ~2x as fast as pd.Series.str.extractall
for a large series: collections.Counter
是大型系列的pd.Series.str.extractall
2 pd.Series.str.extractall
〜:
import pandas as pd
from collections import Counter
from itertools import chain
s = pd.Series(['This is an example #tag1',
'This too is an example #tag1 #tag2',
'Yup, still an example #tag1 #tag1 #tag3'])
def hal(s):
return s.str.extractall(r'(\#\w+)')\
.reset_index(level=0)\
.drop_duplicates()[0]\
.value_counts()
def jp(s):
tags = s.map(lambda x: {i[1:] for i in x.split() if i.startswith('#')})
return Counter(chain.from_iterable(tags))
s = pd.concat([s]*100000, ignore_index=True)
%timeit hal(s) # 2.76 s per loop
%timeit jp(s) # 1.25 s per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.