[英]HTML scraping using python BeautifulSoup
I have following HTML file and I am trying to scrape the complete sentence using BeautifulSoup but couldn't get it. 我有以下HTML文件,我试图使用BeautifulSoup刮掉完整的句子,但无法得到它。 Currently I am getting only highlighted words.
目前我只得到突出显示的单词。 my desired output should be
我想要的输出应该是
Antenna booster has stopped sending signal files ,possible user network issue or BOOSTER issue.
天线助推器已停止发送信号文件,可能的用户网络问题或BOOSTER问题。
Any solution? 有解决方案吗
</table>
<!--Record Header End-->
<span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
Antenna
</span>
<span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
booster
</span>
has stopped
<span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
sending
</span>
signal files ,possible user
<span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
network
</span>
<span style="BACKGROUND-COLOR: #ff0000">
issue
</span>
or BOOSTER
<span style="BACKGROUND-COLOR: #ff0000">
issue
</span>
.
<br>
<br>
<br>
Here is what I tried: 这是我尝试过的:
issue_field = soup.find_all('span', {'style':'BACKGROUND-COLOR: #0000ff; color: #ffffff'})
issue_str = str(issue_field)
Issue_corpora = [word.lower() for word in BeautifulSoup(issue_str,'html.parser').get_text().strip().split(',')]
print(Issue_corpora)
The problem is there are texts outside the elements. 问题是元素之外有文本。 There is a duplicate question on SO already: Get text outside known element beautifulsoup
SO上已经有一个重复的问题: 获取已知元素beautifulsoup之外的文本
So here is the solution, maybe needs a little polishing. 所以这是解决方案,可能需要一点点抛光。 (note that variable
t
contains the html as text) (注意变量
t
包含html作为文本)
from bs4 import BeautifulSoup as bs
soup = bs(t)
text = ''
for span in soup.findAll('span'):
text += getattr(span, 'text', '').strip() + ' '
text += getattr(span, 'nextSibling', '').strip() + ' '
Result using this approach is: 使用这种方法的结果是:
>>> In : text
>>> Out: u'Antenna booster has stopped sending signal files ,possible user network issue or BOOSTER issue . '
You can replace doubled spaces with a single space or remove space before comma or define rules to handle it while looping through span
elements. 您可以使用单个空格替换doubled空格,或者在逗号之前删除空格,或者在循环遍历
span
元素时定义规则来处理它。
from bs4 import BeautifulSoup
import re
example = """your example"""
soup = BeautifulSoup(example, "html.parser")
_text = ""
for span in soup.find_all('span', style=re.compile('BACKGROUND-COLOR:')):
_text += "%s %s" % (span.get_text(strip=True), span.next_sibling.replace("\n", ""))
print (re.sub(" +"," ", _text))
Use re
in the end to trim extra spaces. 最后使用
re
来修剪额外的空间。
Outputs: 输出:
Antenna booster has stopped sending signal files ,possible user network issue or BOOSTER issue .
天线助推器已停止发送信号文件,可能的用户网络问题或BOOSTER问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.