[英]Python regex to extract html paragraph
I'm trying to extract parapgraphs from HTML by using the following line of code:我正在尝试使用以下代码行从 HTML 中提取段落:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is.但它没有返回,即使我知道有。 Why?
为什么?
Why don't use an HTML parser to, well, parse HTML .为什么不使用HTML 解析器来解析 HTML 。 Example using
BeautifulSoup
:使用
BeautifulSoup
示例:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <p>text1</p>
... <p></p>
... <p>text2</p>
... </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']
Note that text=True
helps to filter out empty paragraphs.请注意,
text=True
有助于过滤掉空段落。
Make sure you use re.search
(or re.findall
) instead of re.match
, which attempts to match the entire html string (your html is definitely not beginning and ending with <p>
tags).确保您使用
re.search
(或re.findall
)而不是re.match
,它会尝试匹配整个 html 字符串(您的 html 绝对不是以<p>
标签开头和结尾)。
Should also note that currently your search is greedy meaning it will return everything between the first <p>
tag and the last </p>
which is something you definitely do not want.还应该注意,当前您的搜索是贪婪的,这意味着它将返回第一个
<p>
标记和最后一个</p>
之间的所有内容,这是您绝对不想要的。 Try尝试
re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
instead.反而。 The question mark will make your regex stop matching at the first closing
</p>
tag, and findall
will return multiple matches compared to search
.问号将使您的正则表达式在第一个结束
</p>
标记处停止匹配,与search
相比, findall
将返回多个匹配项。
You should be using re.search
instead of re.match
.您应该使用
re.search
而不是re.match
。 The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.前者将搜索整个字符串,而后者仅在模式位于字符串开头时才匹配。
That said, regular expressions are a horrible tool for parsing HTML.也就是说,正则表达式是解析 HTML 的可怕工具。 You will hit a wall with them very shortly.
你很快就会和他们碰壁。 I strongly recommend you look at HTMLParser or BeautifulSoup for your task.
我强烈建议您查看 HTMLParser 或 BeautifulSoup 来完成您的任务。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.