出现在来自数据框列的文本中的单词列，它们在 R 中的频率

Question

I have a question relating to this old post: R Text mining - how to change texts in R data frame column into several columns with word frequencies?我有一个与这篇旧帖子有关的问题： R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列？

I am trying to mimic something exactly similar to the one posted in link above, using R, however, with strings containing numeric characters.我正在尝试使用 R 模拟与上面链接中发布的完全相似的东西，但是，使用包含数字字符的字符串。

Suppose res is my data frame defined by:假设 res 是我定义的数据框：

library(qdap)
x1 <- as.factor(c( "7317 test1 fool 4258 6287" , "thi1s is 6287 test funny text1 test1", "this is test1 6287 text1 funny fool"))
y1 <- as.factor(c("test2 6287", "this is test text2", "test2 6287"))
z1 <- as.factor(c( "test2 6287" , "this is test 4258 text2 fool", "test2 6287"))
res <- data.frame(x1, y1, z1)

When I calculate frequencies of words defined using these commands,当我计算使用这些命令定义的词的频率时，

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE))
abcd <- data.frame(res, freqs, check.names = FALSE)

abcd ignores 7317, 4258, 6287 and even the number 1 from test1 and counts the frequencies. abcd 忽略 7317、4258、6287 甚至 test1 中的数字 1 并计算频率。

In the first row in column x1, 1 is stripped from test1 and counted as a word.在 x1 列的第一行，从 test1 中剥离 1 并计为一个单词。 Similarly, is is stripped from thi1s and counted as a word.同样， is 从 thi1s 中剥离并算作一个词。 However, what I want is test1.但是，我想要的是test1。 Similarly, the strings 7317, 4258 etc stored as strings must be counted as words and appear in the data table with their frequencies.类似地，以字符串形式存储的字符串 7317、4258 等也必须算作单词，并与其频率一起出现在数据表中。 What must be accomodated extra in the code?代码中必须额外添加什么？

Answer 1

You need to add the following to the freqs statement: removeNumbers = FALSE .您需要将以下内容添加到 freqs 语句中： removeNumbers = FALSE 。 The wfm function calls several other functions and one of them is tm::TermDocumentMatrix . wfm函数调用其他几个函数，其中之一是tm::TermDocumentMatrix 。 In here the default supplied by wfm to this function is that removeNumbers = TRUE .在这里wfm提供给这个函数的默认值是removeNumbers = TRUE 。 So this needs to be set to FALSE .所以这需要设置为FALSE 。

Code:代码：

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE, removeNumbers = FALSE))
abcd <- data.frame(res, freqs, check.names = FALSE)

出现在来自数据框列的文本中的单词列，它们在 R 中的频率

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-04 13:32:26

出现在来自数据框列的文本中的单词列，它们在 R 中的频率

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-04 13:32:26

解决方案1
1 已采纳 2020-03-04 13:32:26