简体   繁体   English

R背后的正则表达式断言

[英]Regex in R lookbehind assertion

I'm trying to do some pattern matching with the extract function from tidyr . 我正在尝试使用tidyrextract函数进行一些模式匹配。 I've tested my regex in a regex practice site, the pattern seems to work, and I am using a lookbehind assertion . 我已经在正则表达式练习站点中测试了我的正则表达式,该模式似乎可行,并且我在使用lookbehind assertion

I have the following sample text: 我有以下示例文本:

=[\"{ Key = source, Values = web,videoTag,assist }\",\"{ Key = type, 
Values = attack }\",\"{ Key = team, Values = 2 }\",\"{ Key = 
originalStartTimeMs, Values = 56496 }\",\"{ Key = linkId, Values = 
1551292895649 }\",\"{ Key = playerJersey, Values = 8 }\",\"{ Key = 
attackLocationStartX, Values = 3.9375 }\",\"{ Key = 
attackLocationStartY, Values = 0.739376770538243 }\",\"{ Key = 
attackLocationStartDeflected, Values = false }\",\"{ Key = 
attackLocationEndX, Values = 1.7897727272727275 }\",\"{ Key = 
attackLocationEndY, Values = -1.3002832861189795 }\",\"{ Key = 
attackLocationEndDeflected, Values = false }\",\"{ Key = lastModified, 
Values = web,videoTag,assist 

I want to grab the numbers following attackLocationX (all numbers following any text about an attack location. 我想抓住attackLocationX的数字(有关攻击位置的所有文本之后的所有数字。

Using the following code with lookbehind assertion, however, I get no results: 但是,将以下代码与lookbehind断言一起使用时,没有任何结果:

df %>% 
extract(message, "x_start",'((?<=attackLocationStartX,/sValues/s=/s)[0- 
9.]+)')

This function will return NA if no pattern match is found, and my target column is all NA values despite having tested the pattern on www.regexr.com . 如果未找到任何模式匹配,此函数将返回NA ,尽管我已经在www.regexr.com上测试了模式,但我的目标列是所有NA值。 According to the documentation, R pattern matching supports lookbehind assertions so I'm not sure what else to do here. 根据文档, R模式匹配支持后置断言,因此我不确定在此还可以做什么。

I'm not sure about the lookbehind part, but in R, you need to escape backslashes. 我不确定后面的部分,但是在R中,您需要转义反斜杠。 This isn't obvious if you are using a regex checker that isn't R-specific. 如果您使用的不是R特定的正则表达式检查器,则这并不明显。

More info here . 更多信息在这里

So you might want your regex to look something like: 因此,您可能希望您的正则表达式看起来像:

"attackLocationStartX,\\sValues\\s=\\s)[0-9.]+"

First of all, to match whitespace you need \\s , not /s . 首先,要匹配空白,您需要\\s ,而不是/s

You do not have to use a lookbehind here, as the extract will return captured substrings if capturing group(s) are used in the pattern. 您不必在此处使用后退,因为如果模式中使用了捕获组,则extract将返回捕获的子字符串。

Use 采用

df %>% 
  extract(message, "x_start", "attackLocationStartX\\s*,\\s*Values\\s*=\\s*(-?\\d+\\.\\d+)")

Output: 3.9375 . 输出: 3.9375

The regex may also look like "attackLocationStartX\\\\s*,\\\\s*Values\\\\s*=\\\\s*(-?\\\\d[.0-9]*)" . 正则表达式也可能看起来像"attackLocationStartX\\\\s*,\\\\s*Values\\\\s*=\\\\s*(-?\\\\d[.0-9]*)"

As the (-?\\\\d+\\\\.\\\\d+) part is captured, only the text in this group will be the output. 由于捕获了(-?\\\\d+\\\\.\\\\d+)部分,因此只有该组中的文本才是输出。

Pattern details 图案细节

  • (-?\\d+\\.\\d+) - a capturing group thst matches (-?\\d+\\.\\d+) -匹配的捕获组
    • -? - an optional hyphen ( ? means 1 or 0 occurrences ) -可选的连字符( ?表示1或0次出现
    • \\d+ - 1 or or digits ( + means 1 or more ) \\d+ -1或或数字( +表示1或更多
    • \\. - a dot -一个点
    • \\d+ - 1 or or digits \\d+ -1或或数字
  • \\d[.0-9]* - a digit ( \\d ), followed with 0 or more dots or digits ( [.0-9]* ) \\d[.0-9]* -一个数字( \\d ),后跟0个或多个点或数字( [.0-9]*

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM