简体   繁体   中英

Extract by regex in Java

I wrote a program in Java using RE to extract several information.

The code aims to extract some information from text file (which after = and before the | sign) that information is located in the middle of {{cite book.....}}

My Code:

  final String regex = "(?:\\{\\{cite book\\b[^|]*|\\G(?!^))(?=[^}]*}})\\|([^=]+)=([^|}]+)";

    final Pattern pattern1 = Pattern.compile(regex);
    final Matcher matcher1 = pattern1.matcher(wikifile);
    System.out.println("+++++++++++++++++++++++++++++++++++++++++++++++++");
    System.out.println("\n BOOK: \n ");



    while (matcher1.find()) {
        if (matcher1.group(1).trim().equals("title")) System.out.println("\n----------------------\n");

        if (matcher1.group(1).trim().equals("title")||matcher1.group(1).trim().equals("first")||matcher1.group(1).trim().equals("last")||matcher1.group(1).trim().equals("auther")||matcher1.group(1).trim().equals("url") || matcher1.group(1).trim().equals("publisher") ||matcher1.group(1).trim().equals("isbn")) {

      System.out.println(matcher1.group(1) + " = " + matcher1.group(2));

    }
    }

It works well when the information is in several lines except that when it is in one long line it does not extract all the information that I want and I do not know what the reason is..

Like

{{Cite book|url=https://books.google.es/books?id=HuSQGrRY7F4C|title=Ajax Black Book, New Edition (With Cd)|last=Kogent Solutions Inc|first =|publisher = Dreamtech Press|year=2008|isbn=978-8177228380|location=|pages =40}}

I want to extract ( URL, Title,last,first,publisher,isbn )

but the output is

 BOOK: 

url = https://books.google.es/books?id=husqgrry7f4c

----------------------

title = ajax black book, new edition (with cd)
last = kogent solutions inc

When the input Like

 {{Cite book
|url=https://books.google.es/books?id=HuSQGrRY7F4C
|title=Ajax Black Book, New Edition (With Cd)
|last=Kogent Solutions Inc
|first =
|publisher = Dreamtech Press
|year=2008
|isbn=978-817722838
|location=
|pages =40}} </ref>

the Output look Like

 BOOK: 

url = https://books.google.es/books?id=husqgrry7f4c


----------------------

title = ajax black book, new edition (with cd)

last = kogent solutions inc

first  = 

publisher  =  dreamtech press

isbn = 978-817722838

last  =  flanagan

first  =  david

update: I think there is a problem with the pattern (regex), when there is a Null or no space between = and |, when element is NULL like first=| or location=| & it was in one line I don't know

2- is there a way to extract (url, title, publisger..etc) by using RE Patron instead of using .group(1).trim().equals("title")

thank you

Last update

Regexp to search data with prefix {{Cite book only, and select multiple key=value pairs separated by pipe '|'character:

(?i:(?<=^|\|)({{Cite\s book\s )|(\s*[^{|}\=]+)\s*\=\s*([^{|}] [ ] ))

The following code demonstrates this regexp:

static final int PREFIX_GROUP = 1;
static final int FIELD_NAME_GROUP = 2;
static final int FIELD_VALUE_GROUP = 3;

// .....
String regex = "(?i:(?<=^|\\|)(\\{\\{Cite\\s*book\\s*)|([^{|}\=]+)\\s*\\=\\s*([^{|}]*[ ]*))";
Pattern pattern = Pattern.compile(regex);

String txt = "{{cite book\n | url=https://books.google.es/books?id=HuSQGrRY7F4C\n | \"title\"=Ajax Black Book, New Edition (With Cd)\n | 'last'=Kogent Solutions Inc | fir$$t =| publisher = Dreamtech Press\n|editor_1= \"William Gates III, Jr.\" |some.dashed-field=TestDot.NET|year=2008\n|  isbn=978-8177228380\n|location=\n|key_w/o_value|pages =40|}}";

Matcher match = pattern.matcher(txt);
while (match.find()) {
    if (match.group(PREFIX_GROUP) != null) {
        System.out.println("prefix: " + match.group(PREFIX_GROUP).trim());
    }
    if (match.group(FIELD_NAME_GROUP) != null) {
        String key   = match.group(FIELD_NAME_GROUP).trim();
        String value = match.group(FIELD_VALUE_GROUP).trim();
        System.out.println(key + " = " + value);
    }
}

and produces output:

prefix: {{cite book
url = https://books.google.es/books?id=HuSQGrRY7F4C
"title" = Ajax Black Book, New Edition (With Cd)
'last' = Kogent Solutions Inc
fir$$t = 
publisher = Dreamtech Press
editor_1 = "William Gates III, Jr."
some.dashed-field = TestDot.NET
year = 2008
isbn = 978-8177228380
location = 
pages = 40

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM