使用 dfm_lookup 的 quanteda 中的多字詞典問題

Question

我是使用 R 和 quanteda 的初學者，即使閱讀了類似的線程，我也無法解決以下問題。

我有一個從 Stata 導入的數據集，其中“text”列包含來自由變量“group”標識的不同人群的推文。 我想通過以下方式計算我的字典在組級別識別的單詞的出現次數：

這是一個可重現的示例：

dput(tweets[1:4, ])
structure(list(tweet_id = c("174457180812_10156824364270813", 
"174457180812_10156824136360813", "174457180812_10156823535820813", 
"174457180812_10156823868565813"), tweet_message = c("Climate change is a big issue", 
"We should care about the environment", "Let's rethink environmental policies", 
"#Davos WEF"
), date = c("2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000", 
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"), group = c("1", 
"2", "3", "4")), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"))

首先我創建我的字典：

    climatechange_dict <- dictionary(list(
  climate = c(
    "environment*",
    "climate change")))

然后我指定語料庫

climate_corpus <- corpus(tweets$tweet_message)

我為每個組創建一個 dfm：

group1_dfm <- dfm(corpus_subset(climate_corpus, tweets$group == "1"))

然后我嘗試為每個組計算字典中單詞的出現頻率：

group1_climate <- dfm_lookup(group1_dfm, dictionary = climatechange_dict)
group1 <- subset(tweets, tweets$group == "1")
group1$climatescore <- as.numeric(group1_climate[,1])

group1$climate <- "normal"
group1$climate[group1$climatescore > 0] <- "climate"
table(group1$climate)

我的問題是，通過這種方式，諸如“氣候變化”之類的多詞詞典條目沒有被計算在內。 我已經在線閱讀我需要將 tokens_lookup() 應用於令牌，然后構建 dfm，但在這種情況下我不知道如何做到這一點。 如果您能在這方面幫助我，我將不勝感激。 非常感謝！

Answer 1

由於您沒有提供可重現的示例，因此很難確保這會起作用，但請嘗試以下操作：

climate_corpus <- corpus(tweets, text_field = "tweet_message")

climatechange_dict <- 
    dictionary(list(climate = c("environment*", "climate change")))

groupeddfm <- tokens(climate_corpus) %>%
    tokens_lookup(dictionary = climatechange_dict) %>%
    dfm(groups = "group")

這將執行以下操作：

從您的tweets data.frame 創建一個語料庫，並將其他變量添加為 docvars。 （如果您知道哪個是唯一的文檔標識符，您也可以使用docid_field = "<yourdocidentifier>"指定該列。）
是否對令牌進行字典“查找”操作，這意味着您將獲取諸如“氣候變化”之類的短語。 dfm_lookup()不會發生這種情況，因為dfm()將標記轉換為不再有順序記錄的“特征”，因此無法恢復短語。
根據tweets的group列將文檔合並到組中。 這消除了使用子集進行任何手動分組的需要。 （我想這就是你想要的，對吧？）

生成的 dfm 將是 ngroups x 1，其中 1 是字典的單個鍵。 您可以使用convert()輕松將其強制轉換為 data.frame 或其他格式。

使用 dfm_lookup 的 quanteda 中的多字詞典問題

問題描述

1 個解決方案

解決方案1
1 2020-03-13 00:30:33

使用 dfm_lookup 的 quanteda 中的多字詞典問題

問題描述

1 個解決方案

解決方案1 1 2020-03-13 00:30:33

解決方案1
1 2020-03-13 00:30:33