在 R 中創建文檔頻率矩陣

Question

我正在嘗試在 R 中創建一個文檔頻率矩陣。

我目前有一個數據框 (df_2)，它由 2 列組成：

doc_num：每個術語來自哪個文檔的詳細信息
text_token：包含與每個文檔相關的每個標記化單詞。

df 的尺寸為 79,447 * 2。

但是，在 79,447 行中只有 400 個實際文檔。

我一直在嘗試使用 tm 包創建這個 dfm。

我嘗試創建一個語料庫（vectorsource），然后嘗試使用適當命名的“dfm”命令將其強制轉換為 dfm。

但是，這表明“dfm() 僅適用於字符、語料庫、dfm、標記對象。” 我知道我的數據目前不是 dfm 命令工作的正確格式。 我的問題是我不知道如何從我當前的點到如下所示的矩陣。

我希望矩陣在完成時看起來像的示例：

其中 2 是 cat 在 doc_2 中出現的次數。

對此的任何幫助將不勝感激。

是mise le meas。

Answer 1

如果您的代碼中提供了所有相關的詳細信息，這將對您和其他人有用 - 例如使用 dfm() 的 quanteda 包。 如果基礎文本設置正確，dfm() 將直接為您提供您要查找的內容 - 這正是它的設置目的。 這是一個模擬：

library(tm)
library(quanteda)
# install.packages("readtext")
library(readtext)

doc1 <- "COVID-19 can be beaten if all ensure social distance, social distance is critical"     
doc2 <- "COVID-19 can be defeated through early self isolation, self isolation is your responsibility" 
doc3 <- "Corona Virus can be beaten through early detection & slowing of spread, Corona Virus can be beaten, Yes, Corona Virus can be beaten" 
doc4 <- "Corona Virus can be defeated through maximization of social distance"  

write.table(doc1,"doc1.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc2,"doc2.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc3,"doc3.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc4,"doc4.txt",sep="\t",row.names=FALSE, col.names = F)
# save above into your WD
getwd()
txt <- readtext(paste0("Your WD/docs", "/*"))
txt

corp <- corpus(txt)
x <- dfm(corp)
View(x)

如果問題是格式化/清理您的數據以便您可以運行 dfm()，那么您需要發布一個新問題，提供有關您的數據的必要詳細信息。

在 R 中創建文檔頻率矩陣

問題描述

1 個解決方案

解決方案1
0 2020-03-21 04:31:13

在 R 中創建文檔頻率矩陣

問題描述

1 個解決方案

解決方案1 0 2020-03-21 04:31:13

解決方案1
0 2020-03-21 04:31:13