简体   繁体   English

如何使用正则表达式匹配字符串

[英]how to match string using regular expression

I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text. 我有一个字符串,其中包含多次出现的"<p class=a> ... </p>" ,其中...是不同的文本。

I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. 我正在使用"<p class=a>(.*)</p>"正则表达式模式将文本拆分为多个块。 but this is not working. 但这不起作用。 what would be the correct regex for this? 什么是正确的正则表达式呢?

PS the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern. PS相同的正则表达式模式在使用NSRegularExpression的iOS中有效,但在Android模式下不能正常使用。

To explain my problem more : i am doing the following 为了进一步说明我的问题:我正在执行以下操作

Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str); 

result array contains only 1 item and it is the whole string 结果数组仅包含1个项目,它是整个字符串

and the following is a portion of the file that i am reading : 以下是我正在读取的文件的一部分:

<BODY>
    <SYNC Start=200>
      <P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
    </SYNC>
    <SYNC Start=2440>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=2560>
      <P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
    </SYNC>
    <SYNC Start=4560>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=66160>
      <P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
    </SYNC>

UPDATE :::: 更新::::

hi everybody, I have got the problem. 大家好,我有问题。 the problem was actually with the encoding of the file that i was reading. 问题实际上出在我正在读取的文件的编码上。 the file was UTF-16 (Little Endian) encoded. 该文件是UTF-16(Little Endian)编码的。 that was causing the all problem of regex not working. 这导致正则表达式的所有问题都不起作用。 i changed it to UTF-8 and everything started working .. thanx everybody for your support. 我将其更改为UTF-8,一切都开始工作..谢谢大家的支持。

Parsing HTML with regular expressions is not really a good idea (reason here ). 用正则表达式解析HTML并不是一个好主意( 这里的原因)。 What you should use in an HTML parser such as this . 你应该在一个HTML解析器使用什么样的,如

That being said, your issue is most likely the fact that the * operator is greedy. 话虽如此,您的问题很可能是*运算符贪婪的事实。 In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p> . 在您的问题中,您只是说它不起作用,所以我认为您的问题是因为它匹配第一个<p class=a>和最后一个</p> Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier). 使正则表达式变为非贪婪,例如: <p class=a>(.*?)</p> (请注意使*运算符变为非贪婪的额外? )应该可以解决问题(假设您的问题是一个我之前已经说过)。

That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers. 话虽这么说,我真的建议您放弃正则表达式方法,并使用适当的HTML解析器。

EDIT: 编辑:

Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind: 现在,您已经发布了代码和与之匹配的文本,马上想到了一件事情:

You're matching <p class... , but your string contains <P Class... . 您正在匹配<p class... ,但是您的字符串包含<P Class... Regexes are case-sensitive. 正则表达式区分大小写。

Then, . 然后, . does not match newlines. 与换行符不匹配。 And it's quite likely that your paragraphs do contain newlines. 而且您的段落很可能确实包含换行符。

Therefore, try "(?si)<p class=a>(.*?)</p>" . 因此,尝试"(?si)<p class=a>(.*?)</p>" The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive. (?s)修饰符还允许点匹配换行符,并且(?i)修饰符使正则表达式不区分大小写。

I guess the problem is that your pattern is greedy. 我想问题是您的模式很贪婪。 You should use this instead. 您应该改用它。

"<p class=a>(.*?)</p>"

If you have this string: 如果您有以下字符串:

"<p class=a>fist</p><p class=a>second</p>"

Your pattern ( "<p class=a>(.*)</p>" ) will match this 您的模式( "<p class=a>(.*)</p>" )将与此匹配

"<p class=a>fist</p><p class=a>second</p>"

While "<p class=a>(.*?)</p>" only matches "<p class=a>(.*?)</p>"仅匹配

"<p class=a>fist</p>"

The .* may match < . 。*可能匹配< You can try : 你可以试试 :

<p class=a>([^<]*)</p>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM