从pandas数据帧中提取并计算每行的唯一主题标签

Question

I have a pandas dataframe df with a string column Posts , something like this: 我有一个带有字符串列Posts的pandas dataframe df ，如下所示：

df['Posts']
0       "This is an example #tag1"
1       "This too is an example #tag1 #tag2"
2       "Yup, still an example #tag1 #tag1 #tag3"

When I tried using the following code to count the number of hashtags, 当我尝试使用以下代码来计算主题标签的数量时，

count_hashtags = df['Posts'].str.extractall(r'(\#\w+)')[0].value_counts()

I get, 我明白了

#tag1             4
#tag2             1
#tag3             1

But I need the result to be count of unique hashtags per row, something like this: 但是我需要将结果计算为每行唯一的标签，如下所示：

#tag1             3
#tag2             1
#tag3             1

Answer 1

use drop_duplicates to get rid of duplicate tags per post, and then you can use value_counts 使用drop_duplicates删除每个帖子的重复标签，然后你可以使用value_counts

df.Posts.str.extractall(
    r'(\#\w+)'
).reset_index().drop_duplicates(['level_0', 0])[0].value_counts()

shorter alternative where level=0 is passed to reset_index 将level=0传递给reset_index较短替代

df.Posts.str.extractall(
    r'(\#\w+)'
).reset_index(level=0).drop_duplicates()[0].value_counts()

both will output: 两者都会输出：

#tag1    3
#tag3    1
#tag2    1
Name: 0, dtype: int64

Answer 2

This is one solution using itertools.chain and collections.Counter : 这是使用itertools.chain和collections.Counter一个解决方案：

import pandas as pd
from collections import Counter
from itertools import chain

s = pd.Series(['This is an example #tag1',
               'This too is an example #tag1 #tag2',
               'Yup, still an example #tag1 #tag1 #tag3'])

tags = s.map(lambda x: {i[1:] for i in x.split() if i.startswith('#')})

res = Counter(chain.from_iterable(tags))

print(res)

Counter({'tag1': 3, 'tag2': 1, 'tag3': 1})

Performance benchmarking 绩效基准

collections.Counter is ~2x as fast as pd.Series.str.extractall for a large series: collections.Counter是大型系列的pd.Series.str.extractall 2 pd.Series.str.extractall 〜：

import pandas as pd
from collections import Counter
from itertools import chain

s = pd.Series(['This is an example #tag1',
               'This too is an example #tag1 #tag2',
               'Yup, still an example #tag1 #tag1 #tag3'])

def hal(s):
    return s.str.extractall(r'(\#\w+)')\
            .reset_index(level=0)\
            .drop_duplicates()[0]\
            .value_counts()

def jp(s):
    tags = s.map(lambda x: {i[1:] for i in x.split() if i.startswith('#')})
    return Counter(chain.from_iterable(tags))

s = pd.concat([s]*100000, ignore_index=True)

%timeit hal(s)  # 2.76 s per loop
%timeit jp(s)   # 1.25 s per loop

从pandas数据帧中提取并计算每行的唯一主题标签

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-05-29 10:52:22

解决方案2
1 2018-05-29 10:46:15

从pandas数据帧中提取并计算每行的唯一主题标签

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-05-29 10:52:22

解决方案2 1 2018-05-29 10:46:15

解决方案1
2 已采纳 2018-05-29 10:52:22

解决方案2
1 2018-05-29 10:46:15