简体   繁体   中英

Using regex in java to extract a string between two words in html syntax

I have a json feed that feeds html that is used to populate the calendar, I need to retrieve some of the information from it. For example title, time and location. I wanted to use regex to get content between

<span class=\"title\"> 

and

<\/span><br/><b>

and I am trying to use this code

for(int i = 0; i < json.length();  i++)
{
    JSONObject object = new JSONObject(json.getJSONObject(i));
    System.out.println(object.getNames(object));

    Pattern p = Pattern.compile("(?i)(<span class=\"title\">)(.+?)(<\\/span>)");
    Matcher m = p.matcher(json.get(0).toString());
    m.find();
    System.out.println(m.group(0));

But it doesn't seem to do the job... I have tried multiple ittoriations and tried researching examples online, but I am not sure if I am doing something wrong in the regex syntax. Help would be appreciated.

{"hoverContent":"<b>Title: <\/b><span class=\"title\">Accounting Awareness<\/span><br/><b>Time: <\/b><span class=\"time\">5:30 PM - 6:30 PM<br/><b>Location: <\/b><span class=\"location\">1185 Grainger Hall<\/span><br/><b>Description: <\/b><br/><span class=\"description\">Information from Kristen Fuhremann, Director of Professional Programs in Accounting and Q&A from a panel of current and former students who will share their experiences in the accounting program. Panel includes a grad of the IMAcc program currently in law school, a candidate for the IMAcc program who studied abroad, an accounting and finance double major, and an IMAcc student who is also a TA for AIS 100. Casual Attire is appropriate.<br />Contact: Natalie Dickson, <a href=\"mailto:ndickson@wisc.edu\">ndickson@wisc.edu<\/a><\/span><br/>","title":"Accounting Awareness","start":"2013-09-30 17:30:00","allDay":false,"itemId":"2356754a-8178-4afd-b4cf-7f5f5ce89868","end":"2013-09-30 18:30:00"}

null

m.group(0) always returns the entire string that matches the regex. It looks like you want to return a particular group, so you need to use m.group(1) to get the text that matches the first group, m.group(2) for the second group, and so on. In this regex:

"(?i)(<span class=\"title\">)(.+?)(<\\/span>)"

anything in parentheses, except for things that begin with (? , counts as a group, so the portion in (.+?) is the second capture group, and you can try retrieving it with m.group(2) . In this case, there's no need to put the <span stuff in parentheses, so you could say

"(?i)<span class=\"title\">(.+?)<\\/span>"

and now use m.group(1) to get at the first (and only) capture group.

Using regexp to parse something is not really a good idea from design standpoint. I would personally just wrap the content in a fake tag and parse it using XML parser. There will be overhead, but you don't use regexp to parse JSON, right? Why not do the same for XML?

尝试使用DOTALL模式的这个正则表达式,也避免冗余转义:

Pattern p = Pattern.compile("(?si)<span class=\"title\">(.+?)</span>");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM