简体   繁体   English

R-在字符串中获取确切的8位数字并将其转换

[英]R - grab the exact 8 digits number in a string and transform it

I have 2 problems in extracting and transforming data using R. Here's the dataset: 我在使用R提取和转换数据时遇到两个问题。这是数据集:

messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test

The problem is: 问题是:

  • I want to grab the number with only 8 digits length. 我想抓住只有8位数字的数字。 In the dataset above, message id 3333...(300 - 3 digits) and 4444...(2837169119 - 10 digits) should not included. 在上面的数据集中,不应该包含消息ID 3333 ...(300-3位数字)和4444 ...(2837169119-10位数字)。 And here's my best shot so far: 这是到目前为止我最好的镜头:

     as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))}))) 

    .
    However, with this line of code, message 444... is included because is contains more than 8 digits number. 但是,在此代码行中,包含消息444 ...,因为它包含多于8位数字。

  • Transform the data to another form like this: 将数据转换为另一种形式,如下所示:

     message_id | customer_ID 1111111111 | 18271801 2222222222 | 12901991 2222222222 | 91222911 

    I don't know how to efficiently transform the data. 我不知道如何有效地转换数据。 The output of dput(df) : dput(df)的输出:

     structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444 ), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed", "wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id", "msg"), row.names = c(NA, 4L), class = "data.frame") 

    Thanks 谢谢

  • Use rebus to create your regular expression, and stringr to extract the matches. 使用rebus创建正则表达式,并使用stringr提取匹配项。

    You may need to play with the exact form of the regular expression. 您可能需要使用正则表达式的确切形式。 This code works on your examples, but you'll probably need to adapt it for your dataset. 这段代码适用于您的示例,但是您可能需要对其进行调整以适合您的数据集。

    library(rebus)
    library(stringr)
    
    # Create regex
    rx <- negative_lookbehind(DGT) %R%
      dgt(8) %R%  
      negative_lookahead(DGT)
    rx
    ## <regex> (?<!\d)[\d]{8}(?!\d)
    
    # Extract the IDs
    extracted_ids <- str_extract_all(df$msg, perl(rx))
    
    # Stuff the IDs into a data frame.
    data.frame(
      messageID = rep(
        df$id, 
        vapply(extracted_ids, length, integer(1))
      ),
      extractedID = unlist(extracted_ids, use.names = FALSE)
    )
    

    声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM