简体   繁体   English

如何用自己表中单词的词干版本替换标记(单词)?

[英]How to replace tokens (words) with stemmed versions of words from my own table?

I got data like this (simplified): 我得到了这样的数据(简化):

library(quanteda)

sample data 样本数据

myText <- c("ala ma kotka", "kasia ma pieska")  
myDF <- data.frame(myText)
myDF$myText <- as.character(myDF$myText)

tokenization 标记化

tokens <- tokens(myDF$myText, what = "word",  
             remove_numbers = TRUE, remove_punct = TRUE,
             remove_symbols = TRUE, remove_hyphens = TRUE)

stemming with my own data sample dictionary 源于我自己的数据样本字典

Origin <- c("kot", "pies")
Word <- c("kotek","piesek")

myDict <- data.frame(Origin, Word)

myDict$Origin <- as.character(myDict$Origin)
myDict$Word <- as.character(myDict$Word)

what i got 我得到了什么

tokens[1]
[1] "Ala"   "ma"    "kotka"

what i would like to get 我想得到什么

tokens[1]
[1] "Ala"   "ma"    "kot"
tokens[2]
[1] "Kasia"   "ma"    "pies"

A similar question has been answered here , but since that question's title (and accepted answer) do not make the obvious link, I will show you how this applies to your question specifically. 这里已经回答一个类似的问题,但是由于该问题的标题(和可接受的答案)没有建立明显的链接,因此,我将向您展示这对您的问题的具体适用方式。 I'll also provide additional detail below to implement your own basic stemmer using wildcards for the suffixes. 我还将在下面提供其他详细信息,以使用通配符作为后缀来实现您自己的基本词干分析器。

Manually mapping stems to inflected forms 手动将词干映射为变形形式

The simplest way to do this is by using a custom dictionary where the keys are your stems, and the values are the inflected forms. 最简单的方法是使用自定义词典,其中键是您的词干,值是变形的形式。 You can then use tokens_lookup() with the exclusive = FALSE, capkeys = FALSE options to convert the inflected terms into their stems. 然后,可以将tokens_lookup()tokens_lookup() exclusive = FALSE, capkeys = FALSE选项一起使用,以将变形的词条转换为其词干。

Note that I have modified your example a little to simplify it, and to correct what I think were mistakes. 请注意,我对您的示例做了一些修改以简化它,并更正我认为是错误的内容。

library("quanteda")
packageVersion("quanteda")
[1] ‘0.99.9’

# no need for the data.frame() call
myText <- c("ala ma kotka", "kasia ma pieska")  
toks <- tokens(myText, 
               remove_numbers = TRUE, remove_punct = TRUE,
               remove_symbols = TRUE, remove_hyphens = TRUE)

Origin <- c("kot", "kot", "pies", "pies")
Word <- c("kotek", "kotka", "piesek", "pieska")

Then we create the dictionary, as follows. 然后,我们创建字典,如下所示。 As of quanteda v0.99.9, values with the same keys are merged, so you could have a list mapping multiple, different inflected forms to the same keys. 从Quanteda v0.99.9开始,具有相同键的值将合并,因此您可以拥有一个将多个不同的变形形式映射到相同键的列表。 Here, I had to add new values since the inflected forms in your original Word vector were not found in the myText example. 在这里,我必须添加新值,因为在myText示例中找不到原始Word向量中的变形形式。

temp_list <- as.list(Word) 
names(temp_list) <- Origin
(stem_dict <- dictionary(temp_list))
## Dictionary object with 2 key entries.
## - [kot]:
##   - kotek, kotka
## - [pies]:
##   - piesek, pieska    

Then tokens_lookup() does its magic. 然后tokens_lookup()了魔力。

tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma"  "kot"
## 
## text2 :
## [1] "kasia" "ma"    "pies" 

Wildcarding all stems from common roots 通配所有源自共同根的词根

An alternative is to implement your own stemmer using the "glob" wildcarding to represent all suffixes for your Origin vector, which (here, at least) produces the same results: 一种替代方法是使用“ glob”通配符来实现您自己的词干分析器,以表示Origin向量的所有后缀,(至少在这里)产生相同的结果:

temp_list <- lapply(unique(Origin), paste0, "*")
names(temp_list) <- unique(Origin)
(stem_dict2 <- dictionary(temp_list))
# Dictionary object with 2 key entries.
# - [kot]:
#   - kot*
# - [pies]:
#   - pies*

tokens_lookup(toks, dictionary = stem_dict, exclusive = FALSE, capkeys = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "ala" "ma"  "kot"
## 
## text2 :
## [1] "kasia" "ma"    "pies" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM