简体   繁体   中英

Regex to match book name — group

I have a regex which I wrote:

value='[A-Za-z]+\\,[0-9]+\\,([A-Za-z0-9]+)\\,([A-Za-z0-9]+)'>[A-Za-z0-9]+\\s-\\s(.*)?\\s\\(

It works fairly well but the problem is that the very end of it keeps matching everything..

For example, it is supposed to work on books and I'm testing it on the following:

value='C,201301,F110,JEWL1050'>JEWL1050 - Industry Skills I (F110)</option>
value='C,201301,F114,JEWL1050'>JEWL1050 - Industry Skills I (F114)</option>
value='C,201301,F114,JEWL1054'>JEWL1054 - Jewellery Rendering & Illustra (F114)</option>
value='C,201301,F110,JEWL2029'>JEWL2029 - Production Techniques B (F110)</option>
value='C,201301,F114,JEWL2029'>JEWL2029 - Production Techniques B (F114)</option>
value='C,201301,LIAD,LANG9066'>LANG9066 - Italian For Beginners (LIAD)</option>
value='C,201301,T302,LAW1151'>LAW1151 - Canandian & Environmental Law (T302)</option>
value='C,201301,T305,LAW1151'>LAW1151 - Canandian & Environmental Law (T305)</option>
value='C,201301,F402,LAW1152'>LAW1152 - International Law & Agreements (F402)</option>
value='C,201301,T302,LAW3201'>LAW3201 - Protection Legislation (T302)</option>
value='C,201301,T303,LAW3201'>LAW3201 - Protection Legislation (T303)</option>
value='C,201301,T304,LAW3201'>LAW3201 - Protection Legislation (T304)</option>

So for the first book, it should capture the F110 as group 1, JEWL1050 as group 2, and Industry Skills I as group 3..

However, it captures the first two groups correctly but not the last group. It captures - Industry Skills I (F110)</option> instead..

Any ideas how I can fix my regex? I can't seem to get it to do the last group at all. Please help me. Thank you in advanced.

In theory, that should be working as-is.

Here's your proposed regex (with \\\\ changed to \\ due to the nature of the tool vs Java code) when applied to your sample input: http://regex101.com/r/hL8pZ8

This tool provides a "Java" checkbox as well, and even the corresponding Java code, although there's no permalink so you'll have to input your regex (again with \\\\ instead of \\ ) and sample data yourself: http://www.myregextester.com/index.php

That said, for posterity, here's its output:

Raw Match Pattern:

  value='[A-Za-z]+\,[0-9]+\,([A-Za-z0-9]+)\,([A-Za-z0-9]+)'>[A-Za-z0-9]+\s-\s(.*)?\s\(

Java Code Example:

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
    String sourcestring = "source string to match with pattern";
    Pattern re = Pattern.compile("value='[A-Za-z]+\\,[0-9]+\\,([A-Za-z0-9]+)\\,([A-Za-z0-9]+)'>[A-Za-z0-9]+\\s-\\s(.*)?\\s\\(");
    Matcher m = re.matcher(sourcestring);
    int mIdx = 0;
    while (m.find()){
      for (int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

$matches Array:
(
  [0] => Array
    (
      [0] => value='C,201301,F110,JEWL1050'>JEWL1050 - Industry Skills I (
      [1] => value='C,201301,F114,JEWL1050'>JEWL1050 - Industry Skills I (
      [2] => value='C,201301,F114,JEWL1054'>JEWL1054 - Jewellery Rendering & Illustra (
      [3] => value='C,201301,F110,JEWL2029'>JEWL2029 - Production Techniques B (
      [4] => value='C,201301,F114,JEWL2029'>JEWL2029 - Production Techniques B (
      [5] => value='C,201301,LIAD,LANG9066'>LANG9066 - Italian For Beginners (
      [6] => value='C,201301,T302,LAW1151'>LAW1151 - Canandian & Environmental Law (
      [7] => value='C,201301,T305,LAW1151'>LAW1151 - Canandian & Environmental Law (
      [8] => value='C,201301,F402,LAW1152'>LAW1152 - International Law & Agreements (
      [9] => value='C,201301,T302,LAW3201'>LAW3201 - Protection Legislation (
      [10] => value='C,201301,T303,LAW3201'>LAW3201 - Protection Legislation (
      [11] => value='C,201301,T304,LAW3201'>LAW3201 - Protection Legislation (
    )

  [1] => Array
    (
      [0] => F110
      [1] => F114
      [2] => F114
      [3] => F110
      [4] => F114
      [5] => LIAD
      [6] => T302
      [7] => T305
      [8] => F402
      [9] => T302
      [10] => T303
      [11] => T304
    )

  [2] => Array
    (
      [0] => JEWL1050
      [1] => JEWL1050
      [2] => JEWL1054
      [3] => JEWL2029
      [4] => JEWL2029
      [5] => LANG9066
      [6] => LAW1151
      [7] => LAW1151
      [8] => LAW1152
      [9] => LAW3201
      [10] => LAW3201
      [11] => LAW3201
    )

  [3] => Array
    (
      [0] => Industry Skills I
      [1] => Industry Skills I
      [2] => Jewellery Rendering & Illustra
      [3] => Production Techniques B
      [4] => Production Techniques B
      [5] => Italian For Beginners
      [6] => Canandian & Environmental Law
      [7] => Canandian & Environmental Law
      [8] => International Law & Agreements
      [9] => Protection Legislation
      [10] => Protection Legislation
      [11] => Protection Legislation
    )
)

Here's a more complex regular expression for this.

value='(?:[^,]+,){2}([^,]+),([^,]+)'>[^-]+-\s+([^(]+)(?=\s)

See live demo

I've checked that C,201301 is not needed. So a simple solution would be to treat the values between < and > as junk, focusing only on > to < :

<option value='C,201301,T302,LAW3201'>LAW3201 - Protection Legislation (T302)</option>
<option value='C,201301,T303,LAW3201'>LAW3201 - Protection Legislation (T303)</option>
<option value='C,201301,T304,LAW3201'>LAW3201 - Protection Legislation (T304)</option>

Which would suggest:

>([A-Z]+[0-9])+\\s-\\s(.*)?\\s([A-Z0-9]+)<

as a sufficient expression for the three groups.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM