简体   繁体   English

R 中的正则表达式:有效地提取模式之前的文本和模式之后的数字

[英]Regex in R: efficiently extracting text before pattern and number immediatelly after pattern

Suppose that in R, I have many strings composed of a mixture of words and numbers, with one thing in common: in all strings, there is always a pattern zzzz followed by a space then a number of unknown digits.假设在 R 中,我有许多由单词和数字混合组成的字符串,有一个共同点:在所有字符串中,总是有一个模式zzzz后跟一个空格,然后是一些未知数字。

For instance:例如:

x <- "many words, some number like 908, then zzzz 145 and some other numbers like 377 and so on"

Then, what I am trying to do is to extract both the numbers that come after the recurring pattern zzzz , but also the text that comes before it.然后,我要做的是提取重复模式zzzz之后数字,以及它之前的文本。

Following this answer , I know how to extract the numbers after the pattern:按照这个答案,我知道如何提取模式后的数字:

regmatches(x, gregexpr("zzzz \\K\\d+", x, perl=TRUE))

That returns "145" , with x example above.返回"145" ,上面有x示例。 What I am trying to find is the most efficient way (since I have millions of strings to evaluate) to retrieve both the number after the pattern but also the content before it, which means returning the vector:我试图找到的是最有效的方法(因为我有数百万个字符串要评估)来检索模式之后的数字以及它之前的内容,这意味着返回向量:

"many words, some number like 908, then " "145"

What would be the most efficient way of achieving that in R?在 R 中实现这一目标的最有效方法是什么?

We can extract data in two groups.我们可以提取两组数据。

  1. Everything till zzzz pattern一切直到zzzz模式
  2. Number followed by zzzz .数字后跟zzzz
strcapture('(.*) zzzz (\\d+)', x, list(col1 = character(), col2 = numeric()))

#                                    col1 col2
#1 many words, some number like 908, then  145       

Here is a base R option using strsplit and sub :这是使用strsplitsub的基本 R 选项:

x <- "many words, some number like 908, then zzzz 145 and some other numbers like 377 and so on"
parts <- strsplit(x, "\\bzzzz\\s+")[[1]]
parts[2] <- sub("\\s+.*$", "", parts[2])
parts

[1] "many words, some number like 908, then "
[2] "145"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM