Python regex to extract html paragraph

Question

I'm trying to extract parapgraphs from HTML by using the following line of code:

paragraphs = re.match(r'<p>.{1,}</p>', html)

but it returns none even though I know there is. Why?

Answer 1

Why don't use an HTML parser to, well, parse HTML . Example using BeautifulSoup :

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
...     <div>
...         <p>text1</p>
...         <p></p>
...         <p>text2</p>
...     </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']

Note that text=True helps to filter out empty paragraphs.

Answer 2

Make sure you use re.search (or re.findall ) instead of re.match , which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).

Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try

re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)

instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search .

Answer 3

You should be using re.search instead of re.match . The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.

That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.

Python regex to extract html paragraph

Question

3 answers

solution1
11 ACCPTED 2015-12-29 01:44:43

solution2
6 2015-12-29 01:40:57

solution3
2 2015-12-29 01:40:33

Python regex to extract html paragraph

Question

3 answers

solution1 11 ACCPTED 2015-12-29 01:44:43

solution2 6 2015-12-29 01:40:57

solution3 2 2015-12-29 01:40:33

solution1
11 ACCPTED 2015-12-29 01:44:43

solution2
6 2015-12-29 01:40:57

solution3
2 2015-12-29 01:40:33