简体   繁体   English

出现在来自数据框列的文本中的单词列,它们在 R 中的频率

[英]word columns appearing in text froma data frame column with their freuency in R

I have a question relating to this old post: R Text mining - how to change texts in R data frame column into several columns with word frequencies?我有一个与这篇旧帖子有关的问题: R 文本挖掘 - 如何将 R 数据框列中的文本更改为具有词频的多列?

I am trying to mimic something exactly similar to the one posted in link above, using R, however, with strings containing numeric characters.我正在尝试使用 R 模拟与上面链接中发布的完全相似的东西,但是,使用包含数字字符的字符串。

Suppose res is my data frame defined by:假设 res 是我定义的数据框:

library(qdap)
x1 <- as.factor(c( "7317 test1 fool 4258 6287" , "thi1s is 6287 test funny text1 test1", "this is test1 6287 text1 funny fool"))
y1 <- as.factor(c("test2 6287", "this is test text2", "test2 6287"))
z1 <- as.factor(c( "test2 6287" , "this is test 4258 text2 fool", "test2 6287"))
res <- data.frame(x1, y1, z1)

When I calculate frequencies of words defined using these commands,当我计算使用这些命令定义的词的频率时,

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE))
abcd <- data.frame(res, freqs, check.names = FALSE)

abcd ignores 7317, 4258, 6287 and even the number 1 from test1 and counts the frequencies. abcd 忽略 7317、4258、6287 甚至 test1 中的数字 1 并计算频率。

In the first row in column x1, 1 is stripped from test1 and counted as a word.在 x1 列的第一行,从 test1 中剥离 1 并计为一个单词。 Similarly, is is stripped from thi1s and counted as a word.同样, is 从 thi1s 中剥离并算作一个词。 However, what I want is test1.但是,我想要的是test1。 Similarly, the strings 7317, 4258 etc stored as strings must be counted as words and appear in the data table with their frequencies.类似地,以字符串形式存储的字符串 7317、4258 等也必须算作单词,并与其频率一起出现在数据表中。 What must be accomodated extra in the code?代码中必须额外添加什么?

You need to add the following to the freqs statement: removeNumbers = FALSE .您需要将以下内容添加到 freqs 语句中: removeNumbers = FALSE The wfm function calls several other functions and one of them is tm::TermDocumentMatrix . wfm函数调用其他几个函数,其中之一是tm::TermDocumentMatrix In here the default supplied by wfm to this function is that removeNumbers = TRUE .在这里wfm提供给这个函数的默认值是removeNumbers = TRUE So this needs to be set to FALSE .所以这需要设置为FALSE

Code:代码:

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE, removeNumbers = FALSE))
abcd <- data.frame(res, freqs, check.names = FALSE)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R文本挖掘-如何将R数据帧列中的文本更改为具有词频的几列? - R Text mining - how to change texts in R data frame column into several columns with word frequencies? r 中的文本挖掘 - 从 r 的数据框中的字符串列中查找最常出现的单词 - Text mining in r - Finding most frequently occurring word from a column of string in a data frame in r R文本挖掘-如何将R数据帧列中的文本更改为具有双字频率的几列? - R Text mining - how to change texts in R data frame column into several columns with bigram frequencies? R如何将数据帧列矩阵扩展为数据帧列 - R How to expand data frame column matrices into data frame columns 根据R中现有列中的文本字符串,使用二进制(0/1)数据创建新的数据框列 - Create new data frame columns with binary (0/1) data based on text strings in existing column in R 从许多数据框列中提取最后一个字(R) - Extracting last word from many data frame columns (R) 从 R data.frame 列中提取括号中的文本到两个或更多新列中 - Extract text in parentheses from an R data.frame column into two or more new columns 数据帧中的R条件求和取决于列中的字 - R conditional sum in data frame depending on word in a column 从 R 的数据框中的列中提取城市(特定单词) - Extract a city (specific word) from a column in a data frame in R 根据 R 中的第一列绘制数据帧的列 - Plot Columns of a Data Frame against the First Column in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM