簡體   English   中英

從pandas數據幀中提取並計算每行的唯一主題標簽

[英]Extract and count unique hashtags per row from a pandas dataframe

我有一個帶有字符串列Posts的pandas dataframe df ,如下所示:

df['Posts']
0       "This is an example #tag1"
1       "This too is an example #tag1 #tag2"
2       "Yup, still an example #tag1 #tag1 #tag3"

當我嘗試使用以下代碼來計算主題標簽的數量時,

count_hashtags = df['Posts'].str.extractall(r'(\#\w+)')[0].value_counts()

我明白了

#tag1             4
#tag2             1
#tag3             1

但是我需要將結果計算為每行唯一的標簽,如下所示:

#tag1             3
#tag2             1
#tag3             1

使用drop_duplicates刪除每個帖子的重復標簽,然后你可以使用value_counts

df.Posts.str.extractall(
    r'(\#\w+)'
).reset_index().drop_duplicates(['level_0', 0])[0].value_counts()

level=0傳遞給reset_index較短替代

df.Posts.str.extractall(
    r'(\#\w+)'
).reset_index(level=0).drop_duplicates()[0].value_counts()

兩者都會輸出:

#tag1    3
#tag3    1
#tag2    1
Name: 0, dtype: int64

這是使用itertools.chaincollections.Counter一個解決方案:

import pandas as pd
from collections import Counter
from itertools import chain

s = pd.Series(['This is an example #tag1',
               'This too is an example #tag1 #tag2',
               'Yup, still an example #tag1 #tag1 #tag3'])

tags = s.map(lambda x: {i[1:] for i in x.split() if i.startswith('#')})

res = Counter(chain.from_iterable(tags))

print(res)

Counter({'tag1': 3, 'tag2': 1, 'tag3': 1})

績效基准

collections.Counter是大型系列的pd.Series.str.extractall 2 pd.Series.str.extractall 〜:

import pandas as pd
from collections import Counter
from itertools import chain

s = pd.Series(['This is an example #tag1',
               'This too is an example #tag1 #tag2',
               'Yup, still an example #tag1 #tag1 #tag3'])

def hal(s):
    return s.str.extractall(r'(\#\w+)')\
            .reset_index(level=0)\
            .drop_duplicates()[0]\
            .value_counts()

def jp(s):
    tags = s.map(lambda x: {i[1:] for i in x.split() if i.startswith('#')})
    return Counter(chain.from_iterable(tags))

s = pd.concat([s]*100000, ignore_index=True)

%timeit hal(s)  # 2.76 s per loop
%timeit jp(s)   # 1.25 s per loop

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM