简体   繁体   English

如何从标记中删除以数字开头的单词?

[英]How to remove words that start with digits from tokens?

How to remove words that start with digits from tokens in quanteda?如何从 quanteda 的标记中删除以数字开头的单词? Sample words: 21st, 80s, 8th, 5k, but they can be completely different and I don't know them in advance.示例词:21st, 80s, 8th, 5k,但它们可能完全不同,我事先并不知道。

I have a data frame with english sentences.我有一个带有英文句子的数据框。 I transformed it to corpus by using quanteda.我使用 quanteda 将其转换为语料库。 Next I transformed corpus to tokens and I did some cleaning like remove_punct , remove_symbols , remove_numbers , etc. However, the remove_numbers function does not delete words that start with digits.接下来,我将语料库转换为标记,并进行了一些清理,例如remove_punctremove_symbolsremove_numbers等。但是, remove_numbers function 不会删除以数字开头的单词。 I would like to delete such words, but I don't know their exact form - it can be eg 21st, 22nd, etc.我想删除这些词,但我不知道它们的确切形式 - 例如可以是 21st、22nd 等。

library("quanteda")

data = data.frame(
  text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications."),
  stringsAsFactors = FALSE
)

corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
              remove_separators = TRUE, split_hyphens = TRUE)
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))

This type of problem requires finding the pattern.这种类型的问题需要找到模式。 Here is a solution using gsub:这是使用 gsub 的解决方案:

text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications.")

text1<-gsub("[0-9]+[a-z]{2}","",text)
# 
# [1] "R is free software and 2k comes with ABSOLUTELY NO WARRANTY."     "You are welcome to redistribute it under 80s certain conditions."
# [3] "Type 'license()' or  'licence()' for distribution details."       "R is a collaborative  project with many contributors."           
# [5] "Type 'contributors()' for more information and"                   "'citation()' on how to cite R or R packages in publications."  

You can refer below question for details:您可以参考以下问题了解详情:

How do I deal with special characters like \^$.?*|+()[{ in my regex? 如何在我的正则表达式中处理特殊字符,如 \^$.?*|+()[{?

https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

You just need to delete them explicitly since they are not managed by remove_numbers = TRUE .您只需要明确删除它们,因为它们不是由remove_numbers = TRUE管理的。 Just use a simple regular expression which looks for some digits before a character.只需使用一个简单的正则表达式,它会在字符之前查找一些数字。 In the example below, I look for a sequence of digits between 1 and 5 (eg (?<=\\d{1,5} ). You can adjust the two numbers to fine tune your regular expression.在下面的示例中,我查找 1 到 5 之间的数字序列(例如(?<=\\d{1,5} )。您可以调整这两个数字来微调您的正则表达式。

Here is the example which only uses quanteda but adds tokens_remove() explicitly.这是仅使用quanteda但显式添加tokens_remove()的示例。

library("quanteda")
#> Package version: 2.0.0
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View

data = data.frame(
  text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications."),
  stringsAsFactors = FALSE
)

corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
              remove_separators = TRUE, split_hyphens = TRUE)
toks = tokens_remove(toks, pattern = "(?<=\\d{1,5})\\w+", valuetype = "regex" )
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))

Created on 2020-05-03 by the reprex package (v0.3.0)reprex package (v0.3.0) 于 2020 年 5 月 3 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 删除粘在 class 标记的 quanteda 对象的单词上的数字 - Remove digits glued to words for quanteda objects of class tokens 从字符串中删除所有以“@”开头的单词 - remove all words that start with “@” from a string 如何删除也有单词的R列中的固定数字? - How to remove fixed digits in column in R that also has words? 如何在R中删除以$开头的语料库中的单词? - How to remove words in corpus that start with $ in R? 删除以大写字母开头的单词 - Remove words that start with uppercase 除以# 开头的单词外,用于删除数字的正则表达式 - A Regex to remove digits except for words starting with # 如何删除 0 个或多个标记(单词),它们可能会按顺序出现或中间有数据出现? - How to remove 0 or more tokens (words), where they might come up sequentially or with data in between? 如何用自己表中单词的词干版本替换标记(单词)? - How to replace tokens (words) with stemmed versions of words from my own table? 如何在 R 中的 Quanteda package 中应用正则表达式以删除连续重复的标记(单词) - How to apply regex in the Quanteda package in R to remove consecutively repeated tokens(words) 如何从R中的ngram标记列表中有效地删除停用词 - How to remove stopwords efficiently from a list of ngram tokens in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM