I wrote a program in Java using RE to extract several information.
The code aims to extract some information from text file (which after = and before the | sign) that information is located in the middle of {{cite book.....}}
My Code:
final String regex = "(?:\\{\\{cite book\\b[^|]*|\\G(?!^))(?=[^}]*}})\\|([^=]+)=([^|}]+)";
final Pattern pattern1 = Pattern.compile(regex);
final Matcher matcher1 = pattern1.matcher(wikifile);
System.out.println("+++++++++++++++++++++++++++++++++++++++++++++++++");
System.out.println("\n BOOK: \n ");
while (matcher1.find()) {
if (matcher1.group(1).trim().equals("title")) System.out.println("\n----------------------\n");
if (matcher1.group(1).trim().equals("title")||matcher1.group(1).trim().equals("first")||matcher1.group(1).trim().equals("last")||matcher1.group(1).trim().equals("auther")||matcher1.group(1).trim().equals("url") || matcher1.group(1).trim().equals("publisher") ||matcher1.group(1).trim().equals("isbn")) {
System.out.println(matcher1.group(1) + " = " + matcher1.group(2));
}
}
It works well when the information is in several lines except that when it is in one long line it does not extract all the information that I want and I do not know what the reason is..
Like
{{Cite book|url=https://books.google.es/books?id=HuSQGrRY7F4C|title=Ajax Black Book, New Edition (With Cd)|last=Kogent Solutions Inc|first =|publisher = Dreamtech Press|year=2008|isbn=978-8177228380|location=|pages =40}}
I want to extract ( URL, Title,last,first,publisher,isbn )
but the output is
BOOK:
url = https://books.google.es/books?id=husqgrry7f4c
----------------------
title = ajax black book, new edition (with cd)
last = kogent solutions inc
When the input Like
{{Cite book
|url=https://books.google.es/books?id=HuSQGrRY7F4C
|title=Ajax Black Book, New Edition (With Cd)
|last=Kogent Solutions Inc
|first =
|publisher = Dreamtech Press
|year=2008
|isbn=978-817722838
|location=
|pages =40}} </ref>
the Output look Like
BOOK:
url = https://books.google.es/books?id=husqgrry7f4c
----------------------
title = ajax black book, new edition (with cd)
last = kogent solutions inc
first =
publisher = dreamtech press
isbn = 978-817722838
last = flanagan
first = david
update: I think there is a problem with the pattern (regex), when there is a Null or no space between =
and |
, when element is NULL like first=|
or location=|
& it was in one line I don't know
2- is there a way to extract (url, title, publisger..etc) by using RE
Patron instead of using .group(1).trim().equals("title")
thank you
Last update
Regexp to search data with prefix {{Cite book
only, and select multiple key=value
pairs separated by pipe '|'
character:
(?i:(?<=^|\|)({{Cite\s book\s )|(\s*[^{|}\=]+)\s*\=\s*([^{|}] [ ] ))
The following code demonstrates this regexp:
static final int PREFIX_GROUP = 1;
static final int FIELD_NAME_GROUP = 2;
static final int FIELD_VALUE_GROUP = 3;
// .....
String regex = "(?i:(?<=^|\\|)(\\{\\{Cite\\s*book\\s*)|([^{|}\=]+)\\s*\\=\\s*([^{|}]*[ ]*))";
Pattern pattern = Pattern.compile(regex);
String txt = "{{cite book\n | url=https://books.google.es/books?id=HuSQGrRY7F4C\n | \"title\"=Ajax Black Book, New Edition (With Cd)\n | 'last'=Kogent Solutions Inc | fir$$t =| publisher = Dreamtech Press\n|editor_1= \"William Gates III, Jr.\" |some.dashed-field=TestDot.NET|year=2008\n| isbn=978-8177228380\n|location=\n|key_w/o_value|pages =40|}}";
Matcher match = pattern.matcher(txt);
while (match.find()) {
if (match.group(PREFIX_GROUP) != null) {
System.out.println("prefix: " + match.group(PREFIX_GROUP).trim());
}
if (match.group(FIELD_NAME_GROUP) != null) {
String key = match.group(FIELD_NAME_GROUP).trim();
String value = match.group(FIELD_VALUE_GROUP).trim();
System.out.println(key + " = " + value);
}
}
and produces output:
prefix: {{cite book
url = https://books.google.es/books?id=HuSQGrRY7F4C
"title" = Ajax Black Book, New Edition (With Cd)
'last' = Kogent Solutions Inc
fir$$t =
publisher = Dreamtech Press
editor_1 = "William Gates III, Jr."
some.dashed-field = TestDot.NET
year = 2008
isbn = 978-8177228380
location =
pages = 40
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.