简体   繁体   中英

how to match string using regular expression

I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text.

I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?

PS the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.

To explain my problem more : i am doing the following

Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str); 

result array contains only 1 item and it is the whole string

and the following is a portion of the file that i am reading :

<BODY>
    <SYNC Start=200>
      <P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
    </SYNC>
    <SYNC Start=2440>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=2560>
      <P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
    </SYNC>
    <SYNC Start=4560>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=66160>
      <P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
    </SYNC>

UPDATE ::::

hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.

Parsing HTML with regular expressions is not really a good idea (reason here ). What you should use in an HTML parser such as this .

That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p> . Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).

That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.

EDIT:

Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:

You're matching <p class... , but your string contains <P Class... . Regexes are case-sensitive.

Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.

Therefore, try "(?si)<p class=a>(.*?)</p>" . The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.

I guess the problem is that your pattern is greedy. You should use this instead.

"<p class=a>(.*?)</p>"

If you have this string:

"<p class=a>fist</p><p class=a>second</p>"

Your pattern ( "<p class=a>(.*)</p>" ) will match this

"<p class=a>fist</p><p class=a>second</p>"

While "<p class=a>(.*?)</p>" only matches

"<p class=a>fist</p>"

The .* may match < . You can try :

<p class=a>([^<]*)</p>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM