简体   繁体   English

如何在 R 中的 Quanteda package 中应用正则表达式以删除连续重复的标记(单词)

[英]How to apply regex in the Quanteda package in R to remove consecutively repeated tokens(words)

I am currently working on a text mining project and after running my ngrams model, I do realize I have sequences of repeated words.我目前正在从事一个文本挖掘项目,在运行我的 ngrams model 之后,我确实意识到我有重复单词的序列。 I would like to remove the repeated words while keeping their first occurrence.我想删除重复的单词,同时保留它们的第一次出现。 An illustration of what I intend to do is demonstrated with the code below.下面的代码演示了我打算做什么。 Thanks!谢谢!


textfun <- "This this this  this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"

textfun <- corpus(textfun)

textfuntoks <- tokens(textfun)

textfunRef <- tokens_replace(textfuntoks, pattern = **?**, replacement = **?**, valuetype ="regex")

The desired result is "This analysis should remove all of the duplicated or repeated words and return only their first occurrence".期望的结果是“此分析应删除所有重复或重复的单词并仅返回它们的第一次出现”。 I am only interested in consecutive repetitions.我只对连续重复感兴趣。

My main problem is in coming up with values for the "pattern" and the "replacement" arguments within the "tokens_replace" function.我的主要问题是在“tokens_replace”function 中提出“模式”和“替换”arguments 的值。 I have tried different patterns, some of which were adapted from sources on here but none seems to work.我尝试了不同的模式,其中一些是从这里的来源改编的,但似乎都没有。 An image of the problem is included.[5grams frequency distribution showing instances such as for words like "swag", "pleas", "gas", "books", & "chicago", "happi"] 1包括问题的图像。[5 克频率分布显示诸如“swag”、“pleas”、“gas”、“books”和“chicago”、“happi”等词的实例] 1

You can split the data at each word, use rle to find consecutive occurrence and paste the first value together.您可以在每个单词处拆分数据,使用rle查找连续出现的位置并将第一个值粘贴在一起。

textfun <- "This this this this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"

paste0(rle(tolower(strsplit(textfun, '\\s+')[[1]]))$values, collapse = ' ')

#[1] "this analysis should remove all of the duplicated or repeated words and return only their first occurrence"

Interesting challenge.有趣的挑战。 To do this within quanteda , you can create a dictionary mapping each repeat sequence into its single occurrence.要在quanteda中执行此操作,您可以创建一个字典,将每个重复序列映射到其单次出现。

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus("This this this  this will analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence")
toks <- tokens(corp)

ngrams <- tokens_tolower(toks) %>%
  tokens_ngrams(n = 5:2, concatenator = " ") %>%
  as.character()
# choose only the ngrams that are all the same word
ngrams <- ngrams[lengths(sapply(strsplit(ngrams, split = " "), unique, simplify = TRUE)) == 1]
# remove duplicates
ngrams <- unique(ngrams)

head(ngrams, n = 3)
## [1] "all all all all all"                "return return return return return"
## [3] "this this this this"

So this provides a vector of all (lowercased) repeated values.因此,这提供了所有(小写的)重复值的向量。 (To avoid lowercasing, remove the tokens_tolower() line.) (为避免小写,删除tokens_tolower()行。)

Now we create a dictionary where each sequence is a "value", and each unique token is the "key".现在我们创建一个字典,其中每个序列都是一个“值”,每个唯一的标记都是“键”。 Multiple identical keys will exist in the list from which dict is built, but the dictionary() constructor automatically combines them.构建dict的列表中将存在多个相同的键,但dictionary()构造函数会自动组合它们。 Once this is created, then the sequences can be converted to the single token using tokens_lookup() .创建完成后,可以使用tokens_lookup()将序列转换为单个标记。

dict <- dictionary(
  structure(
    # this causes each ngram to be treated as a single "value"
    as.list(ngrams),
    # each dictionary key will be the unique token
    names = sapply(ngrams, function(x) strsplit(x, split = " ")[[1]][1], simplify = TRUE, USE.NAMES = FALSE)
  )
)

# convert the sequence to their keys
toks2 <- tokens_lookup(toks, dict, exclusive = FALSE, nested_scope = "dictionary", capkeys = FALSE)

print(toks2, max_ntoken = -1)
## Tokens consisting of 1 document.
## text1 :
##  [1] "this"       "will"       "analysis"   "should"     "remove"    
##  [6] "all"        "of"         "the"        "duplicated" "or"        
## [11] "repeated"   "words"      "and"        "return"     "only"      
## [16] "their"      "first"      "occurrence"

Created on 2021-04-08 by the reprex package (v1.0.0)代表 package (v1.0.0) 于 2021 年 4 月 8 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM