[英]python RE findall() return value is an entire string
I am writing a crawler to get certain parts of a html file. 我正在编写一个爬虫来获取html文件的某些部分。 But I cannot figure out how to use re.findall().
但我无法弄清楚如何使用re.findall()。
Here is an example, when I want to find all ... part in the file, I may write something like this: 这是一个例子,当我想在文件中找到所有...部分时,我可能会写这样的东西:
re.findall("<div>.*\</div>", result_page)
if result_page is a string "<div> </div> <div> </div>"
, the result will be 如果result_page是一个字符串
"<div> </div> <div> </div>"
,结果将是
['<div> </div> <div> </div>']
Only the entire string. 只有整个字符串。 This is not what I want, I am expecting the two divs separately.
这不是我想要的,我期待两个div分开。 What should I do?
我该怎么办?
Quoting the documentation , 引用文档 ,
The
'*'
,'+'
, and'?'
'*'
,'+'
和'?'
qualifiers are all greedy;资格赛都是贪心的; they match as much text as possible.
它们匹配尽可能多的文本。 Adding
'?'
添加
'?'
after the qualifier makes it perform the match in non-greedy or minimal fashion;在限定符之后,它以非贪婪或最小的方式执行匹配; as few characters as possible will be matched.
尽可能少的字符将匹配。
Just add the question mark: 只需添加问号:
In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']
Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. 此外,您不应该使用RegEx来解析HTML,因为HTML解析器就是为此而制作的。 Example using BeautifulSoup 4 :
使用BeautifulSoup 4的示例:
In [7]: import bs4
In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']
*
is a greedy operator, you want to use *?
*
是一个贪婪的运算符,你想使用*?
for a non-greedy match. 对于非贪婪的比赛。
re.findall("<div>.*?</div>", result_page)
Or use a parser such as BeautifulSoup instead of regular expression for this task: 或者使用诸如BeautifulSoup之类的解析器而不是正则表达式来执行此任务:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.