簡體   English   中英

Quanteda dfm_lookup 使用具有多詞模式/表達式的詞典

[英]Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions

我正在使用字典來識別語料庫中一組特定單詞的用法。 我在字典中包含了多詞模式,但是,我認為 dfm_lookup(來自 quanteda 包)不匹配多詞表達式。 有誰知道如何使用包含多詞表達式的字典做與 dfm_lookup 相同的事情?

library(quanteda)

BritainEN <- 
  dictionary(list(identity=c("British", "Great Britain")))


British <- dfm_lookup(debate_dfm,
                       BritishEN,case_insensitive=T)

是的 - 在形成 dfm 之前,您需要對令牌使用tokens_lookup() 一旦你標記了單個單詞,它們就不再作為你需要匹配字典中的多單詞值的有序序列存在。 所以 1) 形成令牌對象,2) 使用tokens_lookup()將字典應用於令牌,然后 3) 形成 dfm。

library("quanteda")
#> Package version: 1.5.2

BritainEN <- 
    dictionary(list(identity = c("British", "Great Britain")))

txt <- c(doc1 = "Great Britain is a country.",
         doc2 = "British citizens live in Great Britain.")

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is"       "a"        "country"  "."       
#> 
#> doc2 :
#> [1] "IDENTITY" "citizens" "live"     "in"       "IDENTITY" "."

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN) %>%
    dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#>       features
#> docs   identity
#>   doc1        1
#>   doc2        2

添加

為了回答額外的評論問題並擴展@phiver 對此非常有用的答案,還有一個nested_scope參數專為可能出現在另一個 MWE 字典鍵值中的匹配而設計。

例子:

library("quanteda")
## Package version: 1.5.2

Ireland_nested <- dictionary(list(
  ie_alone = "Ireland",
  ie_nested = "Northern Ireland"
))

txt <- c(
  doc1 = "Northern Ireland is a country.",
  doc2 = "Some citizens of Ireland live in Northern Ireland."
)

toks <- tokens(txt)

tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE"  "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "IE_ALONE"  "."
tokens_lookup(toks,
  dictionary = Ireland_nested, nested_scope = "dictionary",
  exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "."

第一個匹配兩個鍵,因為嵌套級別僅在鍵內,但嵌套模式出現在兩個不同的鍵中。 (在@phiver 中,模式嵌套在鍵中,在我的示例中它們不是。)當nested_scope = "dictionary"時,它會在整個字典中查找嵌套模式匹配,而不僅僅是在鍵中,因此它不會在我的中重復例子。

您選擇哪個取決於您的目的。 我們將quanteda設計為具有大多數用戶想要和期望的默認值,但為有特定需求的用戶添加了類似這樣的其他選項。 (通常這些需求首先由 Kohei 或我在處理我們自己的特定用例時表達!)

要在評論中回答您的問題:

如果字典包含一個單詞,該單詞也出現在字典中的多詞表達式中,這是如何工作的

如果文本包含“北愛爾蘭”並且字典包含“北愛爾蘭”和“愛爾蘭”,則只會計算一次,但前提是兩個值都在同一字典分組中,就像 Ken 回答中的英國示例一樣。

有關差異,請參見下面的示例。

示例組合字典:

library("quanteda")

Ireland_combined <- 
  dictionary(list(identity = c("Ireland", "Northern Ireland")))

txt <- c(doc1 = "Northern Ireland is a country.",
         doc2 = "Some citizens of Ireland live in Northern Ireland.")

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is"       "a"        "country"  "."       
#
# doc2 :
# [1] "Citizens" "of"       "IDENTITY" "live"     "in"       "IDENTITY" "."  


tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
#       features
# docs   identity
#   doc1        1
#   doc2        2

示例單獨的字典條目:

Ireland_seperated <- 
  dictionary(list(identity1 = c("Ireland"),
                  identity2 = "Northern Ireland"))

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is"        "a"         "country"   "."        
# 
# doc2 :
# [1] "Citizens"  "of"        "IDENTITY1" "live"      "in"        "IDENTITY2" "IDENTITY1" "."      

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   identity1 identity2
#   doc1         1         1
#   doc2         2         1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM