简体   繁体   English

通过拆分正则表达式从Java中的String中提取数字

[英]Extracting numbers from a String in Java by splitting on a regex

I want to extract numbers from Strings like this: 我想从字符串中提取数字,如下所示:

String numbers[] = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34".split(PATTERN);

From such String I'd like to extract these numbers: 从这样的字符串我想提取这些数字:

  • 0.286 0.286
  • -3.099 -3.099
  • -0.44 -0.44
  • -2.901 -2.901
  • -0.436 -0.436
  • 123 123
  • 0.123 0.123
  • .34 0.34

That is: 那是:

  • There can be garbage characters like "M", "c", "c" 可能有垃圾字符,如“M”,“c”,“c”
  • The "-" sign is to include in the number, not to split on “ - ”符号将包含在数字中,而不是分开
  • A "number" can be anything that Float.parseFloat can parse, so .34 is valid “数字”可以是Float.parseFloat可以解析的任何内容,因此.34有效

What I have so far: 到目前为止我所拥有的:

String PATTERN = "([^\\d.-]+)|(?=-)";

Which works to some degree, but obviously far from perfect: 这在某种程度上起作用,但显然远非完美:

  • Doesn't skip the starting garbage "M" in the example 在示例中不跳过起始垃圾“M”
  • Doesn't handle consecutive garbage, like the ,,, in the middle 不处理连续垃圾,如中间的,,,

How to fix PATTERN to make it work? 如何修复PATTERN以使其工作?

You could use a regex like this: 你可以使用这样的正则表达式:

([-.]?\d+(?:\.\d+)?)

Working demo 工作演示

在此输入图像描述

Match Information: 比赛信息:

MATCH 1
1.  [1-6]   `0.286`
MATCH 2
1.  [6-12]  `-3.099`
MATCH 3
1.  [12-17] `-0.44`
MATCH 4
1.  [18-24] `-2.901`
MATCH 5
1.  [25-31] `-0.436`
MATCH 6
1.  [34-37] `123`
MATCH 7
1.  [38-43] `0.123`
MATCH 8
1.  [44-47] `.34`

Update 更新

Jawee 's approach Jawee的方法

As Jawee pointed in his comment there is a problem for .34.34 , so you can use his regex that fix this problem. 作为Jawee在他的评论中指出存在的问题.34.34 ,这样你就可以用他的正则表达式解决这个问题。 Thanks Jawee to point out that. 感谢Jawee指出这一点。

(-?(?:\d+)?\.?\d+)

To have graphic idea about what happens behind this regex you can check this Debuggex image: 要了解此正则表达式背后发生的事情,您可以检查此Debuggex图像:

正则表达式可视化

Engine explanation: 发动机说明:

1st Capturing group (-?(?:\d+)?\.?\d+)
   -? -> matches the character - literally zero and one time
   (?:\d+)? -> \d+ match a digit [0-9] one and unlimited times (using non capturing group)
   \.? matches the character . literally zero and one time
   \d+ match a digit [0-9] one and unlimited times

Try this one (-?(?:\\d+)?\\.?\\d+) 试试这个(-?(?:\\d+)?\\.?\\d+)
Example as below: 示例如下:

Demo Here 在这里演示

Thanks a lot for nhahtdh 's comments. 非常感谢nhahtdh的评论。 That's true, we could update as below: 这是真的,我们可以更新如下:

[-+]?(?:\d+(?:\.\d*)?|\.\d+)

Updated Demo Here 在这里更新了演示

Actually, if we take all possible float input String format (eg: Infinity , -Infinity , 00 , 0xffp23d , 88F ), then it could be a little bit complicated. 其实,如果我们采取一切可能的浮点输入字符串格式(例如: Infinity-Infinity000xffp23d88F ),那么它可能是一个有点复杂。 However, we still could implement it as below Java code: 但是,我们仍然可以在Java代码下面实现它:

String sign = "[-+]?";
String hexFloat = "(?>0[xX](((\\p{XDigit}+)\\.?)|((\\p{XDigit}*)\\.(\\p{XDigit}+)))[pP]([-+])?(\\p{Digit}+)[fFdD]?)";
String nan = "(?>NaN)";
String inf = "(?>Infinity)";

String dig = "(?>\\d+(?:\\.\\d*)?|\\.\\d+)";
String exp = "(?:[eE][-+]?\\d+)?";
String suf = "[fFdD]?";
String digFloat = "(?>" + dig + exp + suf + ")";

String wholeFloat = sign + "(?>" + hexFloat + "|" + nan + "|" + inf + "|" + digFloat + ")";

String s = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123d,.34d.34.34M24.NaNNaN,Infinity,-Infinity00,0xffp23d,88F";

Pattern floatPattern = Pattern.compile(wholeFloat);
Matcher matcher = floatPattern.matcher(s);
int i = 0;
while (matcher.find()) {
    String f =  matcher.group();
    System.out.println(i++ + " : " + f + " --- " +  Float.parseFloat(f) );
}  

Then the output is as below: 然后输出如下:

0 : 0.286 --- 0.286
1 : -3.099 --- -3.099
2 : -0.44 --- -0.44
3 : -2.901 --- -2.901
4 : -0.436 --- -0.436
5 : 123 --- 123.0
6 : 0.123d --- 0.123
7 : .34d --- 0.34
8 : .34 --- 0.34
9 : .34 --- 0.34
10 : 24. --- 24.0
11 : NaN --- NaN
12 : NaN --- NaN
13 : Infinity --- Infinity
14 : -Infinity --- -Infinity
15 : 00 --- 0.0
16 : 0xffp23d --- 2.13909504E9
17 : 88F --- 88.0

Using the regex you crafted yourself you can solve it as follows: 使用您自己制作的正则表达式,您可以按如下方式解决:

String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
                          .replaceAll(PATTERN, " ")
                          .trim()
                          .split(" +");

On the other hand, if I were you, I'd do the loop instead: 另一方面,如果我是你,我会做循环:

Matcher m = Pattern.compile("[.-]?\\d+(\\.\\d+)?").matcher(input);
List<String> matches = new ArrayList<>();
while (m.find())
    matches.add(m.group());

You can do it in one line (but with one less step than aioobe's answer!): 你可以在一行中完成它(但比aioobe的答案少一步!):

String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
    .replaceAll("^[^.\\d-]+|[^.\\d-]+$", "") // remove junk from start/end
    .split("[^.\\d-]+"); // split on anything not part of a number

Although less calls are made, aioobe's answer is easier to read and understand, which makes his better code. 尽管呼叫次数较少,但aioobe的答案更容易阅读和理解,这使他的代码更好。

I think this is exactly what you want: 我想这正是你想要的:

String pattern = "[-+]?[0-9]*\\.?[0-9]+";
String line = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
List<String> numbers=new ArrayList<String>();

while(m.find()) {
    numbers.add(m.group());         
}

Its nice you put a bounty on this. 你很高兴为此付出了赏金。
Unfortunately, as you probably already know, this can't be done using 不幸的是,正如您可能已经知道的那样,这是无法使用的
Java's string split method directly. Java的字符串拆分方法直接。

If it can't be done directly, there is no reason to kludge it as it is, well .. a kludge. 如果它不能直接完成,那么就没有理由把它弄得一团糟,好吧..一个kludge。

The reasons are many, some related, some not. 原因很多,有些相关,有些没有。

To start off, you need to define a good regex as a base. 首先,您需要定义一个好的正则表达式作为基础。
This is the only regex I know that will validate and extract a proper form: 这是我所知道的唯一一个将验证并提取正确形式的正则表达式:

 # "((?=[+-]?\\d*\\.?\\d)[+-]?\\d*\\.?\\d*)"

 (                             # (1 start)
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )                             # (1 end)

So, looking at this base regex, its clear you want this form that it matches. 所以,看看这个基础正则表达式,很明显你想要它匹配的这个表单。
In the case of split, you don't want the form that this matches, because that's 在拆分的情况下,您希望这个匹配的表单,因为那是
where you want the breaks to be. 你想要休息的地方。

As I look at Java's split, I see that no matter what it matches, it will be excluded 当我看到Java的分裂时,我发现无论它匹配什么, 都会被排除在外
from the resulting array. 从结果数组。

So, presuming split usage, the first thing to match (and consume) is all the stuff that is not 因此,假设分割使用,匹配(和消费)的第一件事是所有不是
this. 这个。 That part will be something like this: 那部分将是这样的:

 (?:
      (?!
           (?= [+-]? \d* \.? \d )
           [+-]? \d* \.? \d* 
      )
      . 
 )+

Since the only thing left is valid decimal numbers, the next break will be somewhere 由于剩下的唯一的东西是有效的十进制数字,下一个休息时间将在某个地方
between valid numbers. 有效数字之间。 This part, added to the first part, will be something like this: 这部分添加到第一部分,将是这样的:

 (?:
      (?!
           (?= [+-]? \d* \.? \d )
           [+-]? \d* \.? \d* 
      )
      . 
 )+
 |         # or,
 (?<=
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )
 (?=
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )

And all of a sudden, we have a problem .. a variable length lookbehind assertion 突然间,我们遇到了一个问题.. 一个可变长度的后视断言
So, its game over for the whole thing. 所以,它的游戏结束了整个事情。

Lastly and unfortunately, Java does not (as far as I can see) have a provision to include capture 最后,不幸的是,Java(据我所知)并没有包含捕获的规定
group contents (matched in the regex) as an element in the resulting array. 组内容(在正则表达式中匹配)作为结果数组中的元素。
Perl does, but I can't find that ability in Java. Perl确实如此,但我无法在Java中找到这种能力。

If Java had that provision, the break sub expressions could be combined to do a seamless split. 如果Java具有该规定,则可以组合break子表达式以进行无缝拆分。
Like this: 像这样:

 (?:
      (?!
           (?= [+-]? \d* \.? \d )
           [+-]? \d* \.? \d* 
      )
      . 
 )*
 (
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM