简体   繁体   中英

Extracting numbers from a String in Java by splitting on a regex

I want to extract numbers from Strings like this:

String numbers[] = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34".split(PATTERN);

From such String I'd like to extract these numbers:

  • 0.286
  • -3.099
  • -0.44
  • -2.901
  • -0.436
  • 123
  • 0.123
  • .34

That is:

  • There can be garbage characters like "M", "c", "c"
  • The "-" sign is to include in the number, not to split on
  • A "number" can be anything that Float.parseFloat can parse, so .34 is valid

What I have so far:

String PATTERN = "([^\\d.-]+)|(?=-)";

Which works to some degree, but obviously far from perfect:

  • Doesn't skip the starting garbage "M" in the example
  • Doesn't handle consecutive garbage, like the ,,, in the middle

How to fix PATTERN to make it work?

You could use a regex like this:

([-.]?\d+(?:\.\d+)?)

Working demo

在此输入图像描述

Match Information:

MATCH 1
1.  [1-6]   `0.286`
MATCH 2
1.  [6-12]  `-3.099`
MATCH 3
1.  [12-17] `-0.44`
MATCH 4
1.  [18-24] `-2.901`
MATCH 5
1.  [25-31] `-0.436`
MATCH 6
1.  [34-37] `123`
MATCH 7
1.  [38-43] `0.123`
MATCH 8
1.  [44-47] `.34`

Update

Jawee 's approach

As Jawee pointed in his comment there is a problem for .34.34 , so you can use his regex that fix this problem. Thanks Jawee to point out that.

(-?(?:\d+)?\.?\d+)

To have graphic idea about what happens behind this regex you can check this Debuggex image:

正则表达式可视化

Engine explanation:

1st Capturing group (-?(?:\d+)?\.?\d+)
   -? -> matches the character - literally zero and one time
   (?:\d+)? -> \d+ match a digit [0-9] one and unlimited times (using non capturing group)
   \.? matches the character . literally zero and one time
   \d+ match a digit [0-9] one and unlimited times

Try this one (-?(?:\\d+)?\\.?\\d+)
Example as below:

Demo Here

Thanks a lot for nhahtdh 's comments. That's true, we could update as below:

[-+]?(?:\d+(?:\.\d*)?|\.\d+)

Updated Demo Here

Actually, if we take all possible float input String format (eg: Infinity , -Infinity , 00 , 0xffp23d , 88F ), then it could be a little bit complicated. However, we still could implement it as below Java code:

String sign = "[-+]?";
String hexFloat = "(?>0[xX](((\\p{XDigit}+)\\.?)|((\\p{XDigit}*)\\.(\\p{XDigit}+)))[pP]([-+])?(\\p{Digit}+)[fFdD]?)";
String nan = "(?>NaN)";
String inf = "(?>Infinity)";

String dig = "(?>\\d+(?:\\.\\d*)?|\\.\\d+)";
String exp = "(?:[eE][-+]?\\d+)?";
String suf = "[fFdD]?";
String digFloat = "(?>" + dig + exp + suf + ")";

String wholeFloat = sign + "(?>" + hexFloat + "|" + nan + "|" + inf + "|" + digFloat + ")";

String s = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123d,.34d.34.34M24.NaNNaN,Infinity,-Infinity00,0xffp23d,88F";

Pattern floatPattern = Pattern.compile(wholeFloat);
Matcher matcher = floatPattern.matcher(s);
int i = 0;
while (matcher.find()) {
    String f =  matcher.group();
    System.out.println(i++ + " : " + f + " --- " +  Float.parseFloat(f) );
}  

Then the output is as below:

0 : 0.286 --- 0.286
1 : -3.099 --- -3.099
2 : -0.44 --- -0.44
3 : -2.901 --- -2.901
4 : -0.436 --- -0.436
5 : 123 --- 123.0
6 : 0.123d --- 0.123
7 : .34d --- 0.34
8 : .34 --- 0.34
9 : .34 --- 0.34
10 : 24. --- 24.0
11 : NaN --- NaN
12 : NaN --- NaN
13 : Infinity --- Infinity
14 : -Infinity --- -Infinity
15 : 00 --- 0.0
16 : 0xffp23d --- 2.13909504E9
17 : 88F --- 88.0

Using the regex you crafted yourself you can solve it as follows:

String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
                          .replaceAll(PATTERN, " ")
                          .trim()
                          .split(" +");

On the other hand, if I were you, I'd do the loop instead:

Matcher m = Pattern.compile("[.-]?\\d+(\\.\\d+)?").matcher(input);
List<String> matches = new ArrayList<>();
while (m.find())
    matches.add(m.group());

You can do it in one line (but with one less step than aioobe's answer!):

String[] numbers = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34"
    .replaceAll("^[^.\\d-]+|[^.\\d-]+$", "") // remove junk from start/end
    .split("[^.\\d-]+"); // split on anything not part of a number

Although less calls are made, aioobe's answer is easier to read and understand, which makes his better code.

I think this is exactly what you want:

String pattern = "[-+]?[0-9]*\\.?[0-9]+";
String line = "M0.286-3.099-0.44c-2.901,-0.436,,,123,0.123,.34";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
List<String> numbers=new ArrayList<String>();

while(m.find()) {
    numbers.add(m.group());         
}

Its nice you put a bounty on this.
Unfortunately, as you probably already know, this can't be done using
Java's string split method directly.

If it can't be done directly, there is no reason to kludge it as it is, well .. a kludge.

The reasons are many, some related, some not.

To start off, you need to define a good regex as a base.
This is the only regex I know that will validate and extract a proper form:

 # "((?=[+-]?\\d*\\.?\\d)[+-]?\\d*\\.?\\d*)"

 (                             # (1 start)
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )                             # (1 end)

So, looking at this base regex, its clear you want this form that it matches.
In the case of split, you don't want the form that this matches, because that's
where you want the breaks to be.

As I look at Java's split, I see that no matter what it matches, it will be excluded
from the resulting array.

So, presuming split usage, the first thing to match (and consume) is all the stuff that is not
this. That part will be something like this:

 (?:
      (?!
           (?= [+-]? \d* \.? \d )
           [+-]? \d* \.? \d* 
      )
      . 
 )+

Since the only thing left is valid decimal numbers, the next break will be somewhere
between valid numbers. This part, added to the first part, will be something like this:

 (?:
      (?!
           (?= [+-]? \d* \.? \d )
           [+-]? \d* \.? \d* 
      )
      . 
 )+
 |         # or,
 (?<=
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )
 (?=
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )

And all of a sudden, we have a problem .. a variable length lookbehind assertion
So, its game over for the whole thing.

Lastly and unfortunately, Java does not (as far as I can see) have a provision to include capture
group contents (matched in the regex) as an element in the resulting array.
Perl does, but I can't find that ability in Java.

If Java had that provision, the break sub expressions could be combined to do a seamless split.
Like this:

 (?:
      (?!
           (?= [+-]? \d* \.? \d )
           [+-]? \d* \.? \d* 
      )
      . 
 )*
 (
      (?= [+-]? \d* \.? \d )
      [+-]? \d* \.? \d* 
 )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM