简体   繁体   English

从非结构化文本中提取字符串值

[英]Extracting string value from unstructured text

I'm working with data was structured to use a single field for multiple purposes. 我正在使用的数据结构旨在将单个字段用于多种用途。 I have over 10 thousand records to process, and I need to extract a specific series of characters that have meaning into a different field in my dataFrame. 我有超过1万条记录要处理,我需要提取一系列有意义的特定字符到我的dataFrame的另一个字段中。 There is a predictable pattern to what I need to extract from it; 我需要从中提取出可预测的模式; below is an example: 下面是一个示例:

x = "This field has lots of text and also what I need to extract from it which is 555_AB345678"

What I need to extract is the 555_AB345678 value. 我需要提取的是555_AB345678值。 The leading 3 values (555) and the underscore are all predictable; 前3个值(555)和下划线都是可以预测的; the AB345678 is not. AB345678不是。 However, at least the last 4 values of the string are always numeric. 但是,至少字符串的最后4个值始终是数字。 I cannot guarantee that the values I want are at the end of the string, but in most cases they are so I'd be satisfied to start there. 我不能保证我想要的值在字符串的末尾,但是在大多数情况下它们是对的,因此我很满意从那里开始。

I've explored using gregexpr() with substring(), but haven't got it to work yet. 我已经探索过将gregexpr()与substring()结合使用,但是还没有开始工作。 I was thinking strsplit() could work, however I don't have a predictable delimiter to split on (just a predictable pattern in the values I need). 我以为strsplit()可以工作,但是我没有可分割的分隔符(只是我需要的值中的可预测模式)。 I've also found similar questions, but none that seem to meet my criteria. 我也发现了类似的问题,但似乎没有一个符合我的标准。

extract a substring in R according to a pattern 根据模式提取R中的子字符串

I'd like to see if anyone here has recommendations on how this could be done. 我想看看这里是否有人对如何做到这一点有建议。

The base R way is with this convoluted extractor: 这种卷积提取器的基本R方式是:

regmatches(x, regexpr("555_.*$", x))
# "555_AB345678"

$ is to the end of the string; $在字符串的末尾; and .* , any sequence of characters (including an empty one). .* ,任何字符序列(包括一个空字符)。


Alternately, we can replace the whole string with just the part needed: 或者,我们可以将整个字符串替换为所需的部分:

sub("^.*(555_.*)$", "\\1", x)
# "555_AB345678"

^ is the start of the string, so we are matching the whole string now, from ^ to $ . ^是字符串的开头,因此我们现在要匹配整个字符串,从^$ The \\\\1 replacement refers to the part in parentheses. \\\\1替换是指括号中的部分。 See ?regex for details. 有关详细信息,请参见?regex For an extractor with nicer syntax, you could try the stringr package: 对于语法更好的提取程序,可以尝试使用stringr包:

library(stringr)
str_extract(x, "555_.*$")
# "555_AB345678"

You have a pattern ! 你有模式!

threeLeadingValues-underscore-something-threeDigits is enough to make this expression: threeLeadingValues-下划线-3-Digits足以使该表达式:

/.{3}_.*\d{3}/

https://regex101.com/r/bD0pF2/2 https://regex101.com/r/bD0pF2/2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM