简体   繁体   中英

Extracting string value from unstructured text

I'm working with data was structured to use a single field for multiple purposes. I have over 10 thousand records to process, and I need to extract a specific series of characters that have meaning into a different field in my dataFrame. There is a predictable pattern to what I need to extract from it; below is an example:

x = "This field has lots of text and also what I need to extract from it which is 555_AB345678"

What I need to extract is the 555_AB345678 value. The leading 3 values (555) and the underscore are all predictable; the AB345678 is not. However, at least the last 4 values of the string are always numeric. I cannot guarantee that the values I want are at the end of the string, but in most cases they are so I'd be satisfied to start there.

I've explored using gregexpr() with substring(), but haven't got it to work yet. I was thinking strsplit() could work, however I don't have a predictable delimiter to split on (just a predictable pattern in the values I need). I've also found similar questions, but none that seem to meet my criteria.

extract a substring in R according to a pattern

I'd like to see if anyone here has recommendations on how this could be done.

The base R way is with this convoluted extractor:

regmatches(x, regexpr("555_.*$", x))
# "555_AB345678"

$ is to the end of the string; and .* , any sequence of characters (including an empty one).


Alternately, we can replace the whole string with just the part needed:

sub("^.*(555_.*)$", "\\1", x)
# "555_AB345678"

^ is the start of the string, so we are matching the whole string now, from ^ to $ . The \\\\1 replacement refers to the part in parentheses. See ?regex for details. For an extractor with nicer syntax, you could try the stringr package:

library(stringr)
str_extract(x, "555_.*$")
# "555_AB345678"

You have a pattern !

threeLeadingValues-underscore-something-threeDigits is enough to make this expression:

/.{3}_.*\d{3}/

https://regex101.com/r/bD0pF2/2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM