简体   繁体   English

Python-匹配的正则表达式” <textarea></textarea> ”以及它们之间的任何东西

[英]Python- Regular expression to match“ <textarea> </textarea> ” and anything between them

If the text was 如果文字是

<textarea> xyz asdf qwr </textarea> <textarea> xyz asdf qwr </textarea>

I'm trying to write a regular expression which will help me extract the text in bold . 我正在尝试编写一个正则表达式,这将有助于我以粗体提取文本。

So far I have reached [(<textarea)][</textarea>)] which will capture the tags but I haven't been able to actually capture the text in between the two tags. 到目前为止,我已经到达了[(<textarea)][</textarea>)] ,它将捕获标签,但是我实际上无法捕获两个标签之间的文本。

I also tried [(<textarea)]+.[</textarea>)] and even [[(<textarea)]+.[</textarea>)] but that too isn't giving the right results. 我也尝试过[(<textarea)]+.[</textarea>)]甚至[[(<textarea)]+.[</textarea>)]但那也没有给出正确的结果。

Can someone please throw some light on this or share some links which will help me reach a solution? 有人可以对此发表一些看法或分享一些链接来帮助我找到解决方案吗?

Is there a particular reason that you must use regular expression to parse what seems like HTML? 是否有特定的原因必须使用正则表达式来解析看起来像HTML的东西? I wouldn't do it. 我不会的 See RegEx match open tags except XHTML self-contained tags for the best explanation. 有关最佳说明,请参见RegEx匹配开放标签,但XHTML自包含标签除外

This becomes really simple if you use the BeautifulSoup module, which is going to be far better at parsing HTML (especially if it is messy HTML). 如果使用BeautifulSoup模块,这将变得非常简单,它在解析HTML(尤其是凌乱的HTML)时会更好。

import bs4

f = open("test.html")
soup = bs4.BeautifulSoup(f)

for textarea in soup.find_all('textarea'):
    print textarea.get_text()

You shouldn't parse HTML with regex - parse it with a HTML parser! 您不应该使用正则表达式解析HTML-使用HTML解析器解析HTML! See this answer . 看到这个答案

That being said, if you must use a regex:: 话虽如此,如果您必须使用正则表达式::

The square brackets [] mean "match any character inside", so [<(textarea)] means "match <, (, t, e, x, t, a, r, or )". 方括号[]表示“匹配内部的任何字符”,因此[<(textarea)]表示“匹配<,(,t,e,x,t,a,r或)”。

You probably want <textarea>(.*?)</textarea> , with group 1 (the first set of brackets) being the contents of the tag. 您可能希望<textarea>(.*?)</textarea> ,其中组1(第一组括号)是标签的内容。

This will have problems (for example) if the user writes "</textarea>" inside the text area; 例如,如果用户在文"</textarea>"写入"</textarea>"则会出现问题。 then only up to the first occurence of "</textarea>" will be extracted. 那么最多只能提取到"</textarea>"的第一次出现。 However if you make it non-greedy and do <textarea>.*</textarea> then if you have multiple textarea tags, the .* will match over both of them instead of each individually. 但是,如果您将其设置为非贪婪并执行<textarea>.*</textarea>则如果您有多个textarea标签,则.*会匹配两个标签,而不是单独匹配。 Such are the pitfalls of using regex with HTML. 这就是将正则表达式与HTML结合使用的陷阱。

I think you were struggling to understand that the "+" and "*" operators refer to the pattern they follow , not the pattern they precede. 我认为您很难理解“ +”和“ *”运算符是指它们遵循的模式,而不是它们所遵循的模式。

>>> import re
>>> re.match(r"\<textarea\>.*\<textarea/\>", target)
>>> re.match(r"\<textarea\>.*\</textarea>", target)
<_sre.SRE_Match object at 0x106528b90>
>>> mo = re.match(r"\<textarea\>.*\</textarea>", target)
>>> mo.groups()
()
>>> mo.group(0)
'<textarea> xyz asdf qwr </textarea>'
>>> mo = re.match(r"\<textarea\>(.*)\</textarea>", target)
>>> mo.groups()
(' xyz asdf qwr ',)
>>> mo.group(0)
'<textarea> xyz asdf qwr </textarea>'
>>> mo.group(1)
' xyz asdf qwr '
>>>

Does that help? 有帮助吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM