简体   繁体   中英

Extract a numeric value of a specific length from string in R using regex

Looks like a repeat question but other answers haven't helped me. I'm trying to extract any 8 digit number in a text. The number could be anywhere in the text. It could be stand alone or follow or be followed by string. Basically, I need to extract any occurrence of 8 consecutive numerical characters from a string in R, using regex only.

This is what I attempted but to no avail:

> my_text <- "the number 5849 and 5555555555 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't. both 12345678JE and RG10293847 should turn up as well."

> ## this doesn't work
    > gsub('(\\d{8})', '\\1', my_text)
    [1] "the number 5849 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't.both 12345678JE and RG10293847 should turn up as well."

My desired output should extract the following numbers:

12345654
99119911 
12345678
10293847

While at it, I would also be grateful if the answer includes a second regex expression for extracting only the first occurrence of the 8-digit number:

12345654

EDIT: I have a very large table (about 200 million rows) for which i need to operate this on one column. what is the most efficient solution?

EDIT: I realised that there was a lack of cases in my text case. there are also some digits in the text that are more than 8 digits long, but I only want to extract the ones that are exactly 8 digits.

We can use str_extract_all

stringr::str_extract_all(my_text, "\\d{8}")[[1]]
#[1] "12345654" "99119911" "12345678" "10293847"

Similarly, in base R we can use gregexpr and regmatches

regmatches(my_text, gregexpr("\\d{8}", my_text))[[1]]

To get last 8 digit number, we can use

sub('.*(\\d{8}).*', '\\1', my_text)
#[1] "10293847"

whereas for first one, we can use

sub('.*?(\\d{8}).*', '\\1', my_text)
#[1] "12345654"

EDIT

For the updated case where we want to match with exactly 8 digits (and not more) we can use str_match_all with negative look behind

stringr::str_match_all(my_text, "(?<!\\d)\\d{8}(?!\\d)")[[1]][, 1]
#[1] "12345654" "99119911" "12345678" "10293847"

Here, we get 8-digit numbers which is not followed and proceeded by a digit.

A simple option could also be to extract all the numbers from the string and keep only 8-digit numbers

v1 <- regmatches(my_text, gregexpr("\\d+", my_text))[[1]]
v1[nchar(v1) == 8]

We can do this more specifically to avoid any edge cases

library(stringr)
str_extract_all(my_text, "(?<![0-9])[0-9]{8}(?![0-9])")[[1]]
#[1] "12345654" "99119911" "12345678" "10293847"

To check the difference

v1 <- "hello8888882343, 888884399, 88888888, 8888888888"
str_extract_all(v1, "\\d{8}")
#[[1]]
#[1] "88888823" "88888439" "88888888" "88888888"

Here, it extracts the substring of consecutive numbers greater than 8. According to the OP's post, it would have to be left

str_extract_all(v1,  "(?<![0-9])[0-9]{8}(?![0-9])")
#[[1]]
#[1] "88888888"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM