简体   繁体   中英

Java extract text using regex

I am trying to extract the bold substring from the following string using Java regex:

music works | with | composer | James Hetfield (musician)

I got started with this code, but this does not work. I am not sure what I am missing:

final Pattern pattern = Pattern.compile("| (.+?) (musician)");
final Matcher matcher = pattern.matcher("music works | with | composer | James Hetfield (musician)");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract

Thoughts?

  1. Based on fact that you used ( and ) to create groups I assume that you know that parenthesis are special characters in regex. But do you know that special characters do not match its literals in text? Notice that (.*) will not require matched text to start and end with parenthesis.

    To let special characters match its literals you need to escape them. You can do it in many ways, like:

    • by adding \\ before them (which needs to be written in String as "\\\\" ),
    • or in cases of most special characters you can surround them with [ ] to create character class representing only one character - the special one.

    Similarly | is special character in regex which represents OR operator so you also need to escape it.

  2. Another thing is that .+? despite being reluctant, in | (.+?) | (.+?) will start matching from first | found, which means it can also accept other | until (musician) will be found. In other words such regex would found this aprt

     music works | with | composer | James Hetfield (musician) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 

    So to prevent accepting other pipes ( | ) between the one you accept and (musician) instead of . use [^|] - character class which accepts any character except | .

So try with this pattern:

final Pattern pattern = Pattern.compile("\\| ([^|]+) \\(musician\\)");

UPDATE:

If it is possible that part which should be matched by your regex will not have | before it (lets say it is at start of your text) then you can simply make \\\\| part optional by surrounding it with parenthesis and adding ? after it to make this part optional. You can also place it in non-capturing-group which will let ([^|]+) still be group with index 1 which will let your code stay the same (you will not have to change matcher.gorup(1) to matcher.group(2) ).

So you can try with

final Pattern pattern = Pattern.compile("(?:\\| )?([^|]+) \\(musician\\)");
([a-zA-Z](?:[a-zA-Z ]*))(?=\(musician\))

You can try this as well.Grab the capture.See demo.

http://regex101.com/r/vR4fY4/19

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM