如何根據分組變量計算 quanteda 中的搭配？

Question

我一直致力於在 R 中對 Quenteda package 的搭配進行識別和分類。

例如;

我從文檔列表中創建令牌 object，並應用搭配分析。

toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)

但是，據我所知，沒有明確的方法可以查看哪些搭配在哪個文檔中頻繁/存在。 即使我應用kwic(toks, pattern = phrase(collocations), selection = 'keep' ) 結果也只會包含 rowid 作為 text1, text2 等。

我想根據 docvars 對搭配分析結果進行分組。 Quanteda 可以嗎？

Answer 1

聽起來您希望按文檔統計搭配。 textstat_collocations()中的 output 已經為每個搭配提供了計數，但這些是針對整個語料庫的。

因此，按文檔（或任何其他變量）分組的解決方案是

使用textstat_collocations()獲取搭配。 下面，我在刪除停用詞和標點符號后完成了這項工作。
使用tokens_compound()復合形成停用詞的標記。 這會將每個搭配序列轉換為單個標記。
從復合標記形成一個 dfm，並使用textstat_frequency()按文檔計算復合。 這有點棘手

使用內置的就職語料庫實現：

library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

toks <- data_corpus_inaugural %>%
  tail(10) %>%
  tokens(remove_punct = TRUE, padding = TRUE) %>%
  tokens_remove(stopwords("en"), padding = TRUE)

colls <- textstat_collocations(toks)
head(colls)
##        collocation count count_nested length   lambda        z
## 1           let us    34            0      2 6.257000 17.80637
## 2  fellow citizens    14            0      2 6.451738 16.18314
## 3 fellow americans    15            0      2 6.221678 16.16410
## 4      one another    14            0      2 6.592755 14.56082
## 5        god bless    15            0      2 8.628894 13.57027
## 6    united states    12            0      2 9.192044 13.22077

現在我們將它們復合並只保留搭配，然后通過文檔獲取頻率：

dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
  dfm() %>%
  dfm_keep("* *")

該 dfm 已經包含每個搭配的文檔計數，但是如果您想要 data.frame 格式的計數，並帶有分組選項，請使用textstat_frequency() 。 在這里，我只有 output 文檔的前兩個，但是如果您刪除n = 2 ，那么它將為您提供文檔中所有搭配的頻率。

textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
  head(10)
##             feature frequency rank docfreq        group
## 1   nuclear weapons         4    1       1  1985-Reagan
## 2     human freedom         3    2       1  1985-Reagan
## 3        new breeze         4    1       1    1989-Bush
## 4    new engagement         3    2       1    1989-Bush
## 5            let us         7    1       1 1993-Clinton
## 6  fellow americans         4    2       1 1993-Clinton
## 7            let us         6    1       1 1997-Clinton
## 8       new century         6    1       1 1997-Clinton
## 9  nation's promise         2    1       1    2001-Bush
## 10      common good         2    1       1    2001-Bush

Answer 2

構建一個DFM · Select特征· 查字典· 組文檔... 搭配分析讓我們能夠識別單詞的連續搭配。 ...是專有名稱，可以簡單地根據英文文本中的大小寫來識別

如何根據分組變量計算 quanteda 中的搭配？

問題描述

1 個解決方案

解決方案1
1 已采納 2021-05-25 09:27:13

解決方案2
0 2021-06-21 14:04:44

如何根據分組變量計算 quanteda 中的搭配？

問題描述

1 個解決方案

解決方案1 1 已采納 2021-05-25 09:27:13

解決方案2 0 2021-06-21 14:04:44

解決方案1
1 已采納 2021-05-25 09:27:13

解決方案2
0 2021-06-21 14:04:44