[英]How to combine all 3 in 1 re.findall() ??(python 2.7 && Regular Expressions)
Filter1=re.findall(r'<span (.*?)</span>',PageSource)
Filter2=re.findall(r'<a href=.*title="(.*?)" >',PageSource)
Filter3=re.findall(r'<span class=.*?<b>(.*?)</b>.*?',PageSource)
how to do it in 1 line code ...like this: 如何在1行代码中做到这一点...
Filter=re.findall(r' ',PageSource)
I tried this way: 我这样尝试:
Filter=re.findall(r'<span (.*?)</span>'+
r'<a href=.*title="(.*?)" >'+
r'<span class=.*?<b>(.*?)</b>.*?',PageSource)
But it is not working. 但这是行不通的。
How about using an HTML Parser instead? 如何使用HTML解析器呢?
Example, using BeautifulSoup
: 例如,使用
BeautifulSoup
:
from bs4 import BeautifulSoup
data = "your HTML here"
soup = BeautifulSoup(data)
span_texts = [span.text for span in soup.find_all('span')]
a_titles = [a['title'] for a in soup.find_all('a', title=True)]
b_texts = [b.text for b in soup.select('span[class] > b')]
result = span_texts + a_titles + b_texts
Demo: 演示:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <span>Span's text</span>
... <a title="A title">link</a>
... <span class="test"><b>B's text</b></span>
... </div>
... """
>>> soup = BeautifulSoup(data)
>>>
>>> span_texts = [span.text for span in soup.find_all('span')]
>>> a_titles = [a['title'] for a in soup.find_all('a', title=True)]
>>> b_texts = [b.text for b in soup.select('span[class] > b')]
>>>
>>> result = span_texts + a_titles + b_texts
>>> print result
[u"Span's text", u"B's text", 'A title', u"B's text"]
Aside from that, your regular expressions are pretty different and serve different purposes - I would not try to squeeze unsqueezable, keep them separate and combine the results into a single list. 除此之外,您的正则表达式完全不同并且具有不同的用途-我不会尝试挤压不可挤压的东西,将它们分开并将结果合并到一个列表中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.