如何从标记中删除以数字开头的单词？

Question

如何从 quanteda 的标记中删除以数字开头的单词？ 示例词：21st, 80s, 8th, 5k，但它们可能完全不同，我事先并不知道。

我有一个带有英文句子的数据框。 我使用 quanteda 将其转换为语料库。 接下来，我将语料库转换为标记，并进行了一些清理，例如remove_punct 、 remove_symbols 、 remove_numbers等。但是， remove_numbers function 不会删除以数字开头的单词。 我想删除这些词，但我不知道它们的确切形式 - 例如可以是 21st、22nd 等。

library("quanteda")

data = data.frame(
  text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications."),
  stringsAsFactors = FALSE
)

corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
              remove_separators = TRUE, split_hyphens = TRUE)
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))

Answer 1

这种类型的问题需要找到模式。 这是使用 gsub 的解决方案：

text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications.")

text1<-gsub("[0-9]+[a-z]{2}","",text)
# 
# [1] "R is free software and 2k comes with ABSOLUTELY NO WARRANTY."     "You are welcome to redistribute it under 80s certain conditions."
# [3] "Type 'license()' or  'licence()' for distribution details."       "R is a collaborative  project with many contributors."           
# [5] "Type 'contributors()' for more information and"                   "'citation()' on how to cite R or R packages in publications."

您可以参考以下问题了解详情：

如何在我的正则表达式中处理特殊字符，如 \^$.?*|+()[{？

https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

Answer 2

您只需要明确删除它们，因为它们不是由remove_numbers = TRUE管理的。 只需使用一个简单的正则表达式，它会在字符之前查找一些数字。 在下面的示例中，我查找 1 到 5 之间的数字序列（例如(?<=\\d{1,5} )。您可以调整这两个数字来微调您的正则表达式。

这是仅使用quanteda但显式添加tokens_remove()的示例。

library("quanteda")
#> Package version: 2.0.0
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View

data = data.frame(
  text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications."),
  stringsAsFactors = FALSE
)

corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
              remove_separators = TRUE, split_hyphens = TRUE)
toks = tokens_remove(toks, pattern = "(?<=\\d{1,5})\\w+", valuetype = "regex" )
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))

^{由reprex package (v0.3.0) 于 2020 年 5 月 3 日创建}

如何从标记中删除以数字开头的单词？

问题描述

2 个解决方案

解决方案1
2 2020-05-03 18:48:51

解决方案2
2 已采纳 2020-05-03 19:09:32

如何从标记中删除以数字开头的单词？

问题描述

2 个解决方案

解决方案1 2 2020-05-03 18:48:51

解决方案2 2 已采纳 2020-05-03 19:09:32

解决方案1
2 2020-05-03 18:48:51

解决方案2
2 已采纳 2020-05-03 19:09:32