简体   繁体   中英

R - grab the exact 8 digits number in a string and transform it

I have 2 problems in extracting and transforming data using R. Here's the dataset:

messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test

The problem is:

  • I want to grab the number with only 8 digits length. In the dataset above, message id 3333...(300 - 3 digits) and 4444...(2837169119 - 10 digits) should not included. And here's my best shot so far:

     as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))}))) 

    .
    However, with this line of code, message 444... is included because is contains more than 8 digits number.

  • Transform the data to another form like this:

     message_id | customer_ID 1111111111 | 18271801 2222222222 | 12901991 2222222222 | 91222911 

    I don't know how to efficiently transform the data. The output of dput(df) :

     structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444 ), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed", "wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id", "msg"), row.names = c(NA, 4L), class = "data.frame") 

    Thanks

  • Use rebus to create your regular expression, and stringr to extract the matches.

    You may need to play with the exact form of the regular expression. This code works on your examples, but you'll probably need to adapt it for your dataset.

    library(rebus)
    library(stringr)
    
    # Create regex
    rx <- negative_lookbehind(DGT) %R%
      dgt(8) %R%  
      negative_lookahead(DGT)
    rx
    ## <regex> (?<!\d)[\d]{8}(?!\d)
    
    # Extract the IDs
    extracted_ids <- str_extract_all(df$msg, perl(rx))
    
    # Stuff the IDs into a data frame.
    data.frame(
      messageID = rep(
        df$id, 
        vapply(extracted_ids, length, integer(1))
      ),
      extractedID = unlist(extracted_ids, use.names = FALSE)
    )
    

    The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM