[英]How can I make large additions to textstem's lexicon in R?
I have a large body of free-text survey comments that I'm attempting to analyze.我有大量的自由文本调查评论,我正试图对其进行分析。 I used the textstem package to perform lemmatization, but after looking at the unique tokens it identified I'd like to make further adjustments.我使用 textstem 包来执行词形还原,但在查看了它确定的唯一标记后,我想进行进一步的调整。 For example, it identified "abuses", "abused", and "abusing" as the lemma "abuse" but it left "abusive" untouched...I'd like to change that to "abuse" as well.例如,它将“abuses”、“abused”和“abusing”标识为引理“abuse”,但未触及“abusive”……我也想将其更改为“abuse”。
I found this post which described how to add to the lexicon on a piecemeal basis such as我发现这篇文章描述了如何在零碎的基础上添加到词典中,例如
lemmas <- lexicon::hash_lemmas[token=="abusive",lemma:="abuse"]
lemmatize_strings(words, dictionary = lemmas)
but in my case I'll have a dataframe with several hundred token/lemma pairs.但就我而言,我将有一个包含数百个标记/引理对的数据框。 How can I quickly add them all to lexicon::hash_lemmas?如何快速将它们全部添加到 lexicon::hash_lemmas?
duh...呃……
new_lemmas <- read_csv("newLemmas.csv")
big_lemmas <- rbind(lexicon::hash_lemmas, new_lemmas)
big_lemmas <- big_lemmas[!duplicated(big_lemmas$token)]
then use big_lemmas
as the dictionary然后使用big_lemmas
作为字典
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.