[英]Problem with multiword dictionaries in quanteda using dfm_lookup
[英]Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions
我正在使用字典來識別語料庫中一組特定單詞的用法。 我在字典中包含了多詞模式,但是,我認為 dfm_lookup(來自 quanteda 包)不匹配多詞表達式。 有誰知道如何使用包含多詞表達式的字典做與 dfm_lookup 相同的事情?
library(quanteda)
BritainEN <-
dictionary(list(identity=c("British", "Great Britain")))
British <- dfm_lookup(debate_dfm,
BritishEN,case_insensitive=T)
是的 - 在形成 dfm 之前,您需要對令牌使用tokens_lookup()
。 一旦你標記了單個單詞,它們就不再作為你需要匹配字典中的多單詞值的有序序列存在。 所以 1) 形成令牌對象,2) 使用tokens_lookup()
將字典應用於令牌,然后 3) 形成 dfm。
library("quanteda")
#> Package version: 1.5.2
BritainEN <-
dictionary(list(identity = c("British", "Great Britain")))
txt <- c(doc1 = "Great Britain is a country.",
doc2 = "British citizens live in Great Britain.")
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is" "a" "country" "."
#>
#> doc2 :
#> [1] "IDENTITY" "citizens" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN) %>%
dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#> features
#> docs identity
#> doc1 1
#> doc2 2
添加
為了回答額外的評論問題並擴展@phiver 對此非常有用的答案,還有一個nested_scope
參數專為可能出現在另一個 MWE 字典鍵值中的匹配而設計。
例子:
library("quanteda")
## Package version: 1.5.2
Ireland_nested <- dictionary(list(
ie_alone = "Ireland",
ie_nested = "Northern Ireland"
))
txt <- c(
doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland."
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "IE_ALONE" "."
tokens_lookup(toks,
dictionary = Ireland_nested, nested_scope = "dictionary",
exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "."
第一個匹配兩個鍵,因為嵌套級別僅在鍵內,但嵌套模式出現在兩個不同的鍵中。 (在@phiver 中,模式嵌套在鍵中,在我的示例中它們不是。)當nested_scope = "dictionary"
時,它會在整個字典中查找嵌套模式匹配,而不僅僅是在鍵中,因此它不會在我的中重復例子。
您選擇哪個取決於您的目的。 我們將quanteda設計為具有大多數用戶想要和期望的默認值,但為有特定需求的用戶添加了類似這樣的其他選項。 (通常這些需求首先由 Kohei 或我在處理我們自己的特定用例時表達!)
要在評論中回答您的問題:
如果字典包含一個單詞,該單詞也出現在字典中的多詞表達式中,這是如何工作的
如果文本包含“北愛爾蘭”並且字典包含“北愛爾蘭”和“愛爾蘭”,則只會計算一次,但前提是兩個值都在同一字典分組中,就像 Ken 回答中的英國示例一樣。
有關差異,請參見下面的示例。
示例組合字典:
library("quanteda")
Ireland_combined <-
dictionary(list(identity = c("Ireland", "Northern Ireland")))
txt <- c(doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland.")
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined ) %>%
dfm()
# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
# features
# docs identity
# doc1 1
# doc2 2
示例單獨的字典條目:
Ireland_seperated <-
dictionary(list(identity1 = c("Ireland"),
identity2 = "Northern Ireland"))
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY1" "live" "in" "IDENTITY2" "IDENTITY1" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated ) %>%
dfm()
# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
# features
# docs identity1 identity2
# doc1 1 1
# doc2 2 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.