Using regex in java to extract a string between two words in html syntax

Question

I have a json feed that feeds html that is used to populate the calendar, I need to retrieve some of the information from it. For example title, time and location. I wanted to use regex to get content between

<span class=\"title\">

and

<\/span><br/><b>

and I am trying to use this code

for(int i = 0; i < json.length();  i++)
{
    JSONObject object = new JSONObject(json.getJSONObject(i));
    System.out.println(object.getNames(object));

    Pattern p = Pattern.compile("(?i)(<span class=\"title\">)(.+?)(<\\/span>)");
    Matcher m = p.matcher(json.get(0).toString());
    m.find();
    System.out.println(m.group(0));

But it doesn't seem to do the job... I have tried multiple ittoriations and tried researching examples online, but I am not sure if I am doing something wrong in the regex syntax. Help would be appreciated.

{"hoverContent":"<b>Title: <\/b><span class=\"title\">Accounting Awareness<\/span><br/><b>Time: <\/b><span class=\"time\">5:30 PM - 6:30 PM<br/><b>Location: <\/b><span class=\"location\">1185 Grainger Hall<\/span><br/><b>Description: <\/b><br/><span class=\"description\">Information from Kristen Fuhremann, Director of Professional Programs in Accounting and Q&A from a panel of current and former students who will share their experiences in the accounting program. Panel includes a grad of the IMAcc program currently in law school, a candidate for the IMAcc program who studied abroad, an accounting and finance double major, and an IMAcc student who is also a TA for AIS 100. Casual Attire is appropriate.<br />Contact: Natalie Dickson, <a href=\"mailto:ndickson@wisc.edu\">ndickson@wisc.edu<\/a><\/span><br/>","title":"Accounting Awareness","start":"2013-09-30 17:30:00","allDay":false,"itemId":"2356754a-8178-4afd-b4cf-7f5f5ce89868","end":"2013-09-30 18:30:00"}

null

Answer 1

m.group(0) always returns the entire string that matches the regex. It looks like you want to return a particular group, so you need to use m.group(1) to get the text that matches the first group, m.group(2) for the second group, and so on. In this regex:

"(?i)(<span class=\"title\">)(.+?)(<\\/span>)"

anything in parentheses, except for things that begin with (? , counts as a group, so the portion in (.+?) is the second capture group, and you can try retrieving it with m.group(2) . In this case, there's no need to put the <span stuff in parentheses, so you could say

"(?i)<span class=\"title\">(.+?)<\\/span>"

and now use m.group(1) to get at the first (and only) capture group.

Answer 2

Using regexp to parse something is not really a good idea from design standpoint. I would personally just wrap the content in a fake tag and parse it using XML parser. There will be overhead, but you don't use regexp to parse JSON, right? Why not do the same for XML?

Answer 3

尝试使用DOTALL模式的这个正则表达式，也避免冗余转义：

Pattern p = Pattern.compile("(?si)<span class=\"title\">(.+?)</span>");

Using regex in java to extract a string between two words in html syntax

Question

3 answers

solution1
1 2013-10-28 17:20:59

solution2
1 2013-10-28 17:24:49

solution3
0 2013-10-28 17:19:52

Using regex in java to extract a string between two words in html syntax

Question

3 answers

solution1 1 2013-10-28 17:20:59

solution2 1 2013-10-28 17:24:49

solution3 0 2013-10-28 17:19:52

solution1
1 2013-10-28 17:20:59

solution2
1 2013-10-28 17:24:49

solution3
0 2013-10-28 17:19:52