简体   繁体   English

删除粘在 class 标记的 quanteda 对象的单词上的数字

[英]Remove digits glued to words for quanteda objects of class tokens

A related question can be found here but does not directly tackle this issue I discuss below.可以在此处找到相关问题,但不直接解决我在下面讨论的这个问题。

My goal is to remove any digits that occur with a token.我的目标是删除与令牌一起出现的任何数字。 For instance, I want to be able to get rid of the numbers in situations like: 13f , 408-k , 10-k , etc. I am using quanteda as the main tool.例如,我希望能够在以下情况下摆脱数字: 13f408-k10-k等。我使用quanteda作为主要工具。 I have a classic corpus object which I tokenized using the function tokens() .我有一个经典的语料库 object ,我使用 function tokens()对其进行了标记。 The argument remove_numbers = TRUE does not seem to work in such cases since it just ignores the tokens and leave them where they are.参数remove_numbers = TRUE在这种情况下似乎不起作用,因为它只是忽略标记并将它们留在原处。 If I use tokens_remove() with a specific regex, this removes the tokens which is something I want to avoid since I am interested in the remaining textual content.如果我将tokens_remove()与特定的正则表达式一起使用,这将删除我想要避免的标记,因为我对剩余的文本内容感兴趣。

Here is a minimal where I show how I solved the issue through the function str_remove_all() in stringr .这是一个最小值,我展示了如何通过 stringr 中的function str_remove_all()解决问题。 It works, but can be very slow for big objects.它可以工作,但对于大物体来说可能非常慢。

My question is: is there a way to achieve the same result without leaving quanteda (eg, on an object of class tokens )?我的问题是:有没有办法在不离开quanteda的情况下获得相同的结果(例如,在 class tokens的 object 上)?

library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(stringr)

mytext = c( "This is a sentence with correctly spaced digits like K 16.",
            "This is a sentence with uncorrectly spaced digits like 123asd and well101.")

# Tokenizing
mytokens = tokens(mytext, 
                  remove_punct = TRUE,
                  remove_numbers = TRUE )
mytokens
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "123asd"     
#> [11] "and"         "well101"

# the tokens "123asd" and "well101" are still there.
# I can be more specific using a regex but this removes the tokens altogether
# 
mytokens_wrong = tokens_remove( mytokens, pattern = "[[:digit:]]", valuetype = "regex")
mytokens_wrong
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> text2 :
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "and"

# This is the workaround which seems to be working but can be very slow.
# I am using stringr::str_remove_all() function
# 
mytokens_ok = lapply( mytokens, function(x) str_remove_all( x, "[[:digit:]]" ) )
mytokens_ok
#> $text1
#>  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
#>  [7] "spaced"    "digits"    "like"      "K"        
#> 
#> $text2
#>  [1] "This"        "is"          "a"           "sentence"    "with"       
#>  [6] "uncorrectly" "spaced"      "digits"      "like"        "asd"        
#> [11] "and"         "well"

Created on 2021-02-15 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2021 年 2 月 15 日创建

The other answer is a clever use of tokens_split() but won't always work if you want digits from the middle of words removed (since it will have split the original word containing inner digits into two).另一个答案是对tokens_split()的巧妙使用,但如果您希望删除单词中间的数字(因为它会将包含内部数字的原始单词分成两部分),则并不总是有效。

Here's an efficient way to remove the numeric characters from the types (unique tokens/words):这是从类型(唯一标记/单词)中删除数字字符的有效方法:

library("quanteda")
## Package version: 2.1.2

mytext <- c(
  "This is a sentence with correctly spaced digits like K 16.",
  "This is a sentence with uncorrectly spaced digits like 123asd and well101."
)
toks <- tokens(mytext, remove_punct = TRUE, remove_numbers = TRUE)

# get all types with digits
typesnum <- grep("[[:digit:]]", types(toks), value = TRUE)
typesnum
## [1] "123asd"  "well101"

# replace the types with types without digits
tokens_replace(toks, typesnum, gsub("[[:digit:]]", "", typesnum))
## Tokens consisting of 2 documents.
## text1 :
##  [1] "This"      "is"        "a"         "sentence"  "with"      "correctly"
##  [7] "spaced"    "digits"    "like"      "K"        
## 
## text2 :
##  [1] "This"        "is"          "a"           "sentence"    "with"       
##  [6] "uncorrectly" "spaced"      "digits"      "like"        "asd"        
## [11] "and"         "well"

Note normally I recommend stringi for all regex operations but used the base package functions here for simplicity.请注意,通常我建议将 stringi用于所有正则表达式操作,但为了简单起见,此处使用了基本的 package 函数。

Created on 2021-02-15 by the reprex package (v1.0.0)代表 package (v1.0.0) 于 2021 年 2 月 15 日创建

In this case you could (ab)use tokens_split .在这种情况下,您可以(ab)使用tokens_split You split the tokens on the digits and by default tokens_split removes the separator.您在数字上拆分标记,默认情况下tokens_split删除分隔符。 In this way you can do everything in quanteda.通过这种方式,您可以在 quanteda 中完成所有操作。

library(quanteda)

mytext = c( "This is a sentence with correctly spaced digits like K 16.",
            "This is a sentence with uncorrectly spaced digits like 123asd and well101.")

# Tokenizing
mytokens = tokens(mytext, 
                  remove_punct = TRUE,
                  remove_numbers = TRUE)

tokens_split(mytokens, separator = "[[:digit:]]", valuetype = "regex")
Tokens consisting of 2 documents.
text1 :
 [1] "This"      "is"        "a"         "sentence"  "with"      "correctly" "spaced"    "digits"    "like"     
[10] "K"        

text2 :
 [1] "This"        "is"          "a"           "sentence"    "with"        "uncorrectly" "spaced"      "digits"     
 [9] "like"        "asd"         "and"         "well"       

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM