python RE findall（）返回值是一个完整的字符串

Question

I am writing a crawler to get certain parts of a html file. 我正在编写一个爬虫来获取html文件的某些部分。 But I cannot figure out how to use re.findall(). 但我无法弄清楚如何使用re.findall（）。

Here is an example, when I want to find all ... part in the file, I may write something like this: 这是一个例子，当我想在文件中找到所有...部分时，我可能会写这样的东西：

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>" , the result will be 如果result_page是一个字符串"<div> </div> <div> </div>" ，结果将是

['<div> </div> <div> </div>']

Only the entire string. 只有整个字符串。 This is not what I want, I am expecting the two divs separately. 这不是我想要的，我期待两个div分开。 What should I do? 我该怎么办？

Answer 1

Quoting the documentation , 引用文档，

The '*' , '+' , and '?' '*' ， '+'和'?' qualifiers are all greedy; 资格赛都是贪心的; they match as much text as possible. 它们匹配尽可能多的文本。 Adding '?' 添加'?' after the qualifier makes it perform the match in non-greedy or minimal fashion; 在限定符之后，它以非贪婪或最小的方式执行匹配; as few characters as possible will be matched. 尽可能少的字符将匹配。

Just add the question mark: 只需添加问号：

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. 此外，您不应该使用RegEx来解析HTML，因为HTML解析器就是为此而制作的。 Example using BeautifulSoup 4 : 使用BeautifulSoup 4的示例：

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']

Answer 2

* is a greedy operator, you want to use *? *是一个贪婪的运算符，你想使用*? for a non-greedy match. 对于非贪婪的比赛。

re.findall("<div>.*?</div>", result_page)

Or use a parser such as BeautifulSoup instead of regular expression for this task: 或者使用诸如BeautifulSoup之类的解析器而不是正则表达式来执行此任务：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')

python RE findall（）返回值是一个完整的字符串

问题描述

2 个解决方案

解决方案1
6 2015-04-26 04:31:55

解决方案2
4 2015-04-26 04:32:04

python RE findall（）返回值是一个完整的字符串

问题描述

2 个解决方案

解决方案1 6 2015-04-26 04:31:55

解决方案2 4 2015-04-26 04:32:04

解决方案1
6 2015-04-26 04:31:55

解决方案2
4 2015-04-26 04:32:04