如何用正则表达式捕获科学记数法中的减号？

Question

I was trying to answer a question (that later got deleted) that I think was asking about extracting text representations of scientific notation. 我试图回答一个问题（后来被删除）我认为是在提取科学记数法的文本表示。 (Using R's implementation of regex that requires double escapes for meta-characters and can be used in either pure PCRE or Perl modes, the difference between which I don't really understand.) I've solved most of the task but still seem to be failing to capture the leading minus-sign within a capture group. （使用R的regex实现，需要对元字符进行双重转义，并且可以在纯PCRE或Perl模式中使用，我之间的差异我并不真正理解。）我已经解决了大部分任务但仍然似乎无法捕获捕获组中的前导减号。 The only way I seem to get it to succeed is by using the leading open-parenthesis: 我似乎唯一能让它成功的方法是使用前导的开括号：

> txt <- c("this is some random text (2.22222222e-200)", "other random (3.33333e4)", "yet a third(-1.33333e-40)", 'and a fourth w/o the "e" (2.22222222-200)')
> sub("^(.+\\()([-+]{0,1}[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 

> sub("^(.+\\()([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 
 #but that seems to be "cheating" ... my failures follow:

> sub("^(.+)([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 
> sub("^(.+)(-?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 
> sub("^(.+)(-*[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200"

I've searched SO to the extent of my patience with terms like `scientific notation regex minus' 我用“科学记数法正则表达式减去”之类的术语来搜索我的耐心程度

Answer 1

You can try 你可以试试

 library(stringr)
 unlist(str_extract_all(txt, '-?[0-9.]+e?[-+]?[0-9]*'))
 #[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200"

Using method based on capturing after leading parentheses 使用基于前导括号后捕获的方法

 str_extract(txt, '(?<=\\()[^)]*')
 #[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200"

Answer 2

Reasoning that it was the "greedy" capacity of the "(.+)" first capture group to gobble up the minus sign that was optional in the second capture-group, I terminated the first capture-group with a negation-character-class and now have success. 推断是“（。+）”第一个捕获组的“贪婪”能力吞噬了第二个捕获组中可选的减号，我用一个否定字符类终止了第一个捕获组现在已经成功了。 This still seems clunky and hoping there is something more elegant. 这仍然显得笨重，希望有更优雅的东西。 In searching have seen Python code that seems to imply that there are regex definitions of "&real_number"> 在搜索中看到Python代码似乎暗示有“＆real_number”>的正则表达式定义

> sub("^(.+[^-+])([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt,perl=TRUE)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200"

After looking at the code in str_extract_all which uses substr to extract matches, I now think I should have chosen the gregexpr-regmatches paradigm for my efforts rather than the pick-the-middle of-a-three-capture-group strategy: 在查看使用substr来提取匹配的str_extract_all中的代码之后，我现在认为我应该为我的努力选择gregexpr-regmatches范例，而不是选择三个捕获组策略的中间：

> hits <- gregexpr('[-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3}', txt)
> ?regmatches
> regmatches(txt, hits)
[[1]]
[1] "2.22222222e-200"

[[2]]
[1] "3.33333e4"

[[3]]
[1] "-1.33333e-40"

[[4]]
[1] "2.22222222-200"

Answer 3

This seems to work, and won't match an IP address: 这似乎有效，并且与IP地址不匹配：

sub("^.*?([-+]?\\d+(?:\\.\\d*)*(?:[Ee]?[-+]?\\d+)?).*?$", "\\1", txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200"

Oddly, that's not quite the regex I started with. 奇怪的是，这不是我开始的正则表达式。 When try one didn't work, I thought I would go back and test in Perl: 当尝试一个不起作用时，我想我会回去测试Perl：

my @txt = (
  "this is some random text (2.22222222e-200)",
  "other random (3.33333e4)",
  "yet a third(-1.33333e-40)" ,
  'and a fourth w/o the "e" (2.22222222-200)');

map { s/^.*?[^-+]([-+]?\d+(?:\.\d*)*(?:[Ee]?[-+]?\d+)?).*?$/$1/ } @txt;

print join("\n", @txt),"\n";

And that looked good: 这看起来不错：

2.22222222e-200
3.33333e4
-1.33333e-40
2.22222222-200

So the same regex should work in R, right? 所以同样的正则表达式应该适用于R，对吗？

sub("^.*?[^-+]([-+]?\\d+(?:\\.\\d*)*(?:[Ee]?[-+]?\\d+)?).*?$", "\\1", txt)
[1] "0" "4" "0" "0"

Apparently not. 显然不是。 I even confirmed that the double-quoted string is correct by trying it in Javascript with new RegExp(" ... ") , and it worked fine there, too. 我甚至通过使用new RegExp(" ... ")在Javascript中尝试它来确认双引号字符串是正确的，并且它也在那里工作正常。 Not sure what's different about R, but removing the negated sign character class did the trick. 不确定R有什么不同，但删除否定符号字符类就可以了。

如何用正则表达式捕获科学记数法中的减号？

问题描述

3 个解决方案

解决方案1
6 已采纳 2015-05-03 19:54:01

解决方案2
2 2015-05-03 19:50:41

解决方案3
1 2015-05-03 20:26:56

如何用正则表达式捕获科学记数法中的减号？

问题描述

3 个解决方案

解决方案1 6 已采纳 2015-05-03 19:54:01

解决方案2 2 2015-05-03 19:50:41

解决方案3 1 2015-05-03 20:26:56

解决方案1
6 已采纳 2015-05-03 19:54:01

解决方案2
2 2015-05-03 19:50:41

解决方案3
1 2015-05-03 20:26:56