簡體   English   中英

大熊貓DataFrame中每個單詞的頻率

[英]Frequency of each word in pandas DataFrame

我有一個熊貓DataFrame,如下所示

    key message                                         Final Category
0   1   I have not received my gifts which I ordered ok      voucher
1   2   hth her wells idyll McGill kooky bbc.co              noclass
2   3   test test test 1 test                                noclass
3   4   test                                                 noclass
4   5   hello where is my reward points                      other
5   6   hi, can you get koovs coupons or vouchers here       options
6   7   Hi Hey when you people will include amazon an        options

我想獲得一種{key:{key:value},..}類型的數據結構,其中第一個groupby最終類別,對於每個類別,我都有一個針對每個單詞頻率的字典。 例如,我可以將所有noclass分組,如下所示:{'noclass':{'test':5,'1':1,'hth':1,'her':1 ....},}

我是SOF的新手,很抱歉寫得不好。 謝謝

可能有一種更雄辯的方法來執行此操作,但是這里有一堆嵌套的for循環:

final_cat_list = df['Final Category'].unique()

word_count = {}
for f in final_cat_list:
    word_count[f] = {}
    message_list = list(df.loc[df['Final Category'] == f, 'key message'])
    for m in message_list:
        word_list = m.split(" ")
        for w in word_list:
            if w in word_count[f]:
                word_count[f][w] += 1
            else:
                word_count[f][w] = 1

這會修改原始df,因此您可能要先復制它

from collections import Counter
df["message"] = df["message"].apply(lambda message: message + " ")
df.groupby(["Final Category"]).sum().applymap(lambda message: Counter(message.split()))

該代碼的作用:首先,它在所有消息的末尾添加一個空格。 這將在以后出現。 然后按“最終類別”分組,然后匯總每個組中的消息。 這是尾隨空格很重要的地方,否則消息的最后一個單詞將被粘貼到下一個單詞的第一個單詞。 (求和是字符串的串聯)

然后,沿着空格將字符串拆分以獲取單詞,然后進行計數。

import pandas as pd 
import numpy as np

# copy/paste data (you can skip this since you already have a dataframe)
dict = {0 : {'key': 1 , 'message': "I have not received my gifts which I ordered ok",     'Final Category': 'voucher'},
        1 : {'key': 2 , 'message': "hth her wells idyll McGill kooky bbc.co",             'Final Category': 'noclass'},
        2 : {'key': 3 , 'message': "test test test 1 test",                               'Final Category': 'noclass'},
        3 : {'key': 4 , 'message': "test",                                                'Final Category': 'noclass'},
        4 : {'key': 5 , 'message': "hello where is my reward points",                   'Final Category': 'other'},
        5 : {'key': 6 , 'message': "hi, can you get koovs coupons or vouchers here",      'Final Category': 'options'},
        6 : {'key': 7 , 'message': "Hi Hey when you people will include amazon an",       'Final Category': 'options'}
        }

# make DataFrame (you already have one)
df = pd.DataFrame(dict).T

# break up text into words, combine by 'Final' in my case
df.message = df.message.str.split(' ')
final_df = df.groupby('Final Category').agg(np.sum)

# make final dictionary
final_dict = {}
for label,text in zip(final_df.index, final_df.message):  
    final_dict[label] = {w: text.count(w) for w in text}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM