简体   繁体   English

python RE findall()返回值是一个完整的字符串

[英]python RE findall() return value is an entire string

I am writing a crawler to get certain parts of a html file. 我正在编写一个爬虫来获取html文件的某些部分。 But I cannot figure out how to use re.findall(). 但我无法弄清楚如何使用re.findall()。

Here is an example, when I want to find all ... part in the file, I may write something like this: 这是一个例子,当我想在文件中找到所有...部分时,我可能会写这样的东西:

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>" , the result will be 如果result_page是一个字符串"<div> </div> <div> </div>" ,结果将是

['<div> </div> <div> </div>']

Only the entire string. 只有整个字符串。 This is not what I want, I am expecting the two divs separately. 这不是我想要的,我期待两个div分开。 What should I do? 我该怎么办?

Quoting the documentation , 引用文档

The '*' , '+' , and '?' '*''+''?' qualifiers are all greedy; 资格赛都是贪心的; they match as much text as possible. 它们匹配尽可能多的文本。 Adding '?' 添加'?' after the qualifier makes it perform the match in non-greedy or minimal fashion; 在限定符之后,它以非贪婪或最小的方式执行匹配; as few characters as possible will be matched. 尽可能少的字符将匹配。

Just add the question mark: 只需添加问号:

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. 此外,您不应该使用RegEx来解析HTML,因为HTML解析器就是为此而制作的。 Example using BeautifulSoup 4 : 使用BeautifulSoup 4的示例:

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']

* is a greedy operator, you want to use *? *是一个贪婪的运算符,你想使用*? for a non-greedy match. 对于非贪婪的比赛。

re.findall("<div>.*?</div>", result_page)

Or use a parser such as BeautifulSoup instead of regular expression for this task: 或者使用诸如BeautifulSoup之类的解析器而不是正则表达式来执行此任务:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all('div')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM