使用熊猫（python）计算数据框中的标记词

Question

我在 Python 的数据框中创建了一个标记化数据（文本）

我只想计算标记化数据并输出显示标记化数据中每个元素的重复频率。

这是我用来创建标记化数据的代码：

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import re

def tokenize(txt):
    tokens = re.split('\W+', txt)
    return tokens

Complains['clean_text_tokenized'] = Complains['clean text'].apply(lambda x: tokenize(x.lower()))

# Complains['clean text'] is the original file of the data


Complains['clean_text_tokenized'].head(10)

这是标记化数据的输出


0                   [comcast, cable, internet, speeds]
1     [payment, disappear, service, got, disconnected]
2                                [speed, and, service]
3    [comcast, imposed, a, new, usage, cap, of, 300...
4    [comcast, not, working, and, no, service, to, ...
5    [isp, charging, for, arbitrary, data, limits, ...
6    [throttling, service, and, unreasonable, data,...
7    [comcast, refuses, to, help, troubleshoot, and...
8                         [comcast, extended, outages]
9    [comcast, raising, prices, and, not, being, av...
Name: clean_text_tokenized, dtype: object

任何意见将是有益的

Answer 1

您可以使用Counter ：

from collections import Counter
# ... and then
def tokenize(txt):
    return Counter(re.split('\W+', txt))

查看 Python 测试：

from collections import Counter
import pandas as pd
import re

Complains = pd.DataFrame({'clean text':['comcast, cable, internet, speeds', 'payment, disappear, service, got, disconnected']})

Complains['clean_text_tokenized'] = Complains['clean text'].str.findall(r'\w+')

freq = Counter([item for sublist in Complains['clean_text_tokenized'].to_list() for item in sublist])

使用熊猫（python）计算数据框中的标记词

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-07-22 15:50:46

使用熊猫（python）计算数据框中的标记词

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-07-22 15:50:46

解决方案1
0 已采纳 2021-07-22 15:50:46