[英]How to replace tokens (words) with stemmed versions of words from my own table?
I got data like this (simplified): 我得到了这样的数据(简化):
library(quanteda)
sample data 样本数据
myText <- c("ala ma kotka", "kasia ma pieska")
myDF <- data.frame(myText)
myDF$myText <- as.character(myDF$myText)
tokenization 标记化
tokens <- tokens(myDF$myText, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
stemming with my own data sample dictionary 源于我自己的数据样本字典
Origin <- c("kot", "pies")
Word <- c("kotek","piesek")
myDict <- data.frame(Origin, Word)
myDict$Origin <- as.character(myDict$Origin)
myDict$Word <- as.character(myDict$Word)
what i got 我得到了什么
tokens[1]
[1] "Ala" "ma" "kotka"
what i would like to get 我想得到什么
tokens[1]
[1] "Ala" "ma" "kot"
tokens[2]
[1] "Kasia" "ma" "pies"
A similar question has been answered here , but since that question's title (and accepted answer) do not make the obvious link, I will show you how this applies to your question specifically. 这里已经回答了一个类似的问题,但是由于该问题的标题(和可接受的答案)没有建立明显的链接,因此,我将向您展示这对您的问题的具体适用方式。 I'll also provide additional detail below to implement your own basic stemmer using wildcards for the suffixes. 我还将在下面提供其他详细信息,以使用通配符作为后缀来实现您自己的基本词干分析器。
The simplest way to do this is by using a custom dictionary where the keys are your stems, and the values are the inflected forms. 最简单的方法是使用自定义词典,其中键是您的词干,值是变形的形式。 You can then use tokens_lookup()
with the exclusive = FALSE, capkeys = FALSE
options to convert the inflected terms into their stems. 然后,可以将tokens_lookup()
与tokens_lookup()
exclusive = FALSE, capkeys = FALSE
选项一起使用,以将变形的词条转换为其词干。
Note that I have modified your example a little to simplify it, and to correct what I think were mistakes. 请注意,我对您的示例做了一些修改以简化它,并更正我认为是错误的内容。
library("quanteda")
packageVersion("quanteda")
[1] ‘0.99.9’
# no need for the data.frame() call
myText <- c("ala ma kotka", "kasia ma pieska")
toks <- tokens(myText,
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
Origin <- c("kot", "kot", "pies", "pies")
Word <- c("kotek", "kotka", "piesek", "pieska")
Then we create the dictionary, as follows. 然后,我们创建字典,如下所示。 As of quanteda v0.99.9, values with the same keys are merged, so you could have a list mapping multiple, different inflected forms to the same keys. 从Quanteda v0.99.9开始,具有相同键的值将合并,因此您可以拥有一个将多个不同的变形形式映射到相同键的列表。 Here, I had to add new values since the inflected forms in your original Word
vector were not found in the myText
example. 在这里,我必须添加新值,因为在myText
示例中找不到原始Word
向量中的变形形式。
temp_list <- as.list(Word)
names(temp_list) <- Origin
(stem_dict <- dictionary(temp_list))
## Dictionary object with 2 key entries.
## - [kot]:
## - kotek, kotka
## - [pies]:
## - piesek, pieska
Then tokens_lookup()
does its magic. 然后tokens_lookup()
了魔力。
tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma" "kot"
##
## text2 :
## [1] "kasia" "ma" "pies"
An alternative is to implement your own stemmer using the "glob" wildcarding to represent all suffixes for your Origin
vector, which (here, at least) produces the same results: 一种替代方法是使用“ glob”通配符来实现您自己的词干分析器,以表示Origin
向量的所有后缀,(至少在这里)产生相同的结果:
temp_list <- lapply(unique(Origin), paste0, "*")
names(temp_list) <- unique(Origin)
(stem_dict2 <- dictionary(temp_list))
# Dictionary object with 2 key entries.
# - [kot]:
# - kot*
# - [pies]:
# - pies*
tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma" "kot"
##
## text2 :
## [1] "kasia" "ma" "pies"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.