簡體   English   中英

計算 Pandas 上的 str 頻率

[英]Counting str frequencies on Pandas

這是 Pandas 制作的我的數據樣本

word = pd.Series[['a', 'b', 'c', 'd'],['b', 'c'],['c', 'd'],['a', 'b', 'c']]我想獲取頻率(1)和語料庫數據(2)

(1) 頻率(排序)

b : 3 
c : 3
d : 2
a : 2

(2)語料數據(不排序)

corpus = ['a b c d', 'b c', 'c d', 'a b c']

我怎樣才能得到這些? 我需要幫助

我使用 python 進行韓語 NLP:這是我的代碼

import numpy as np
import pandas as pd

import itertools as it
from khaiii import KhaiiiApi # Korean NLP

df = pd.read_csv('https://drive.google.com/u/0/uc?id=1IZ1NYJmbabv6Xo7WJeqRcDFl1Z5pumni&export=download', encoding = 'utf-8')
df = pd.DataFrame(df)

api = KhaiiiApi()

def parse(sentence):
        pos = ((morph.lex, morph.tag) for word in api.analyze(sentence) for morph in word.morphs if morph.tag in ['NNG', 'VV', 'VA', 'NNP'])    # only nng, vv, va
        words = [item[0] if item[1] == 'NNG' or item[1] == 'NNP' else f'{item[0]}다' for item in pos]  # append suffix
        return words

df['내용'] = df["내용"].str.replace(",", "") 

split = df.내용.str.split(".")
split = split.apply(lambda x: pd.Series(x))
split = split.stack().reset_index(level=1,drop=True).to_frame('sentences')
df = df.merge(split, left_index=True, right_index=True, how='left')
df = df.drop(['내용'], axis = 1)
df['sentences'].replace('', np.nan, inplace= True)  
df['sentences'].replace(' ', np.nan, inplace= True)
df.dropna(subset=['sentences'], inplace=True)

df['reconstruct'] = df['sentences'].apply(parse)

您可以在explode后使用value_counts獲得頻率,pandas 0.25+

word.explode().value_counts()
c    4
b    3
d    2
a    2
dtype: int64

你可以得到值

corpus = [' '.join(v) for k, v in word.to_dict().items()]
print(corpus)
['a b c d', 'b c', 'c d', 'a b c']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM