Quanteda dfm_lookup 使用具有多詞模式/表達式的詞典

Question

我正在使用字典來識別語料庫中一組特定單詞的用法。 我在字典中包含了多詞模式，但是，我認為 dfm_lookup（來自 quanteda 包）不匹配多詞表達式。 有誰知道如何使用包含多詞表達式的字典做與 dfm_lookup 相同的事情？

library(quanteda)

BritainEN <- 
  dictionary(list(identity=c("British", "Great Britain")))


British <- dfm_lookup(debate_dfm,
                       BritishEN,case_insensitive=T)

Answer 1

是的 - 在形成 dfm 之前，您需要對令牌使用tokens_lookup() 。 一旦你標記了單個單詞，它們就不再作為你需要匹配字典中的多單詞值的有序序列存在。 所以 1) 形成令牌對象，2) 使用tokens_lookup()將字典應用於令牌，然后 3) 形成 dfm。

library("quanteda")
#> Package version: 1.5.2

BritainEN <- 
    dictionary(list(identity = c("British", "Great Britain")))

txt <- c(doc1 = "Great Britain is a country.",
         doc2 = "British citizens live in Great Britain.")

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is"       "a"        "country"  "."       
#> 
#> doc2 :
#> [1] "IDENTITY" "citizens" "live"     "in"       "IDENTITY" "."

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN) %>%
    dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#>       features
#> docs   identity
#>   doc1        1
#>   doc2        2

添加

為了回答額外的評論問題並擴展@phiver 對此非常有用的答案，還有一個nested_scope參數專為可能出現在另一個 MWE 字典鍵值中的匹配而設計。

例子：

library("quanteda")
## Package version: 1.5.2

Ireland_nested <- dictionary(list(
  ie_alone = "Ireland",
  ie_nested = "Northern Ireland"
))

txt <- c(
  doc1 = "Northern Ireland is a country.",
  doc2 = "Some citizens of Ireland live in Northern Ireland."
)

toks <- tokens(txt)

tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE"  "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "IE_ALONE"  "."
tokens_lookup(toks,
  dictionary = Ireland_nested, nested_scope = "dictionary",
  exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "."

第一個匹配兩個鍵，因為嵌套級別僅在鍵內，但嵌套模式出現在兩個不同的鍵中。 （在@phiver 中，模式嵌套在鍵中，在我的示例中它們不是。）當nested_scope = "dictionary"時，它會在整個字典中查找嵌套模式匹配，而不僅僅是在鍵中，因此它不會在我的中重復例子。

您選擇哪個取決於您的目的。 我們將quanteda設計為具有大多數用戶想要和期望的默認值，但為有特定需求的用戶添加了類似這樣的其他選項。 （通常這些需求首先由 Kohei 或我在處理我們自己的特定用例時表達！）

Answer 2

要在評論中回答您的問題：

如果字典包含一個單詞，該單詞也出現在字典中的多詞表達式中，這是如何工作的

如果文本包含“北愛爾蘭”並且字典包含“北愛爾蘭”和“愛爾蘭”，則只會計算一次，但前提是兩個值都在同一字典分組中，就像 Ken 回答中的英國示例一樣。

有關差異，請參見下面的示例。

示例組合字典：

library("quanteda")

Ireland_combined <- 
  dictionary(list(identity = c("Ireland", "Northern Ireland")))

txt <- c(doc1 = "Northern Ireland is a country.",
         doc2 = "Some citizens of Ireland live in Northern Ireland.")

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is"       "a"        "country"  "."       
#
# doc2 :
# [1] "Citizens" "of"       "IDENTITY" "live"     "in"       "IDENTITY" "."  


tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
#       features
# docs   identity
#   doc1        1
#   doc2        2

示例單獨的字典條目：

Ireland_seperated <- 
  dictionary(list(identity1 = c("Ireland"),
                  identity2 = "Northern Ireland"))

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is"        "a"         "country"   "."        
# 
# doc2 :
# [1] "Citizens"  "of"        "IDENTITY1" "live"      "in"        "IDENTITY2" "IDENTITY1" "."      

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   identity1 identity2
#   doc1         1         1
#   doc2         2         1

Quanteda dfm_lookup 使用具有多詞模式/表達式的詞典

問題描述

2 個解決方案

解決方案1
4 已采納 2020-01-23 17:08:23

解決方案2
4 2020-01-24 09:56:42

Quanteda dfm_lookup 使用具有多詞模式/表達式的詞典

問題描述

2 個解決方案

解決方案1 4 已采納 2020-01-23 17:08:23

解決方案2 4 2020-01-24 09:56:42

解決方案1
4 已采納 2020-01-23 17:08:23

解決方案2
4 2020-01-24 09:56:42