[英]Counting str frequencies on Pandas
This is my data sample made by Pandas这是 Pandas 制作的我的数据样本
word = pd.Series[['a', 'b', 'c', 'd'],['b', 'c'],['c', 'd'],['a', 'b', 'c']]
I would like to get frequencies(1) and corpus data(2) word = pd.Series[['a', 'b', 'c', 'd'],['b', 'c'],['c', 'd'],['a', 'b', 'c']]
我想获取频率(1)和语料库数据(2)
(1) frequencies(sorting) (1) 频率(排序)
b : 3
c : 3
d : 2
a : 2
(2) corpus data(not sorting) (2)语料数据(不排序)
corpus = ['a b c d', 'b c', 'c d', 'a b c']
How can I get these?我怎样才能得到这些? I need help
我需要帮助
I use python for Korean NLP : This is my code我使用 python 进行韩语 NLP:这是我的代码
import numpy as np
import pandas as pd
import itertools as it
from khaiii import KhaiiiApi # Korean NLP
df = pd.read_csv('https://drive.google.com/u/0/uc?id=1IZ1NYJmbabv6Xo7WJeqRcDFl1Z5pumni&export=download', encoding = 'utf-8')
df = pd.DataFrame(df)
api = KhaiiiApi()
def parse(sentence):
pos = ((morph.lex, morph.tag) for word in api.analyze(sentence) for morph in word.morphs if morph.tag in ['NNG', 'VV', 'VA', 'NNP']) # only nng, vv, va
words = [item[0] if item[1] == 'NNG' or item[1] == 'NNP' else f'{item[0]}다' for item in pos] # append suffix
return words
df['내용'] = df["내용"].str.replace(",", "")
split = df.내용.str.split(".")
split = split.apply(lambda x: pd.Series(x))
split = split.stack().reset_index(level=1,drop=True).to_frame('sentences')
df = df.merge(split, left_index=True, right_index=True, how='left')
df = df.drop(['내용'], axis = 1)
df['sentences'].replace('', np.nan, inplace= True)
df['sentences'].replace(' ', np.nan, inplace= True)
df.dropna(subset=['sentences'], inplace=True)
df['reconstruct'] = df['sentences'].apply(parse)
You can get the frequencies with value_counts
after explode
, pandas 0.25+您可以在
explode
后使用value_counts
获得频率,pandas 0.25+
word.explode().value_counts()
c 4
b 3
d 2
a 2
dtype: int64
You can get the values with你可以得到值
corpus = [' '.join(v) for k, v in word.to_dict().items()]
print(corpus)
['a b c d', 'b c', 'c d', 'a b c']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.