使用python BeautifulSoup进行HTML抓取

Question

I have following HTML file and I am trying to scrape the complete sentence using BeautifulSoup but couldn't get it. 我有以下HTML文件，我试图使用BeautifulSoup刮掉完整的句子，但无法得到它。 Currently I am getting only highlighted words. 目前我只得到突出显示的单词。 my desired output should be 我想要的输出应该是

Antenna booster has stopped sending signal files ,possible user network issue or BOOSTER issue. 天线助推器已停止发送信号文件，可能的用户网络问题或BOOSTER问题。

Any solution? 有解决方案吗

  </table>
  <!--Record Header End-->
  <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
   Antenna
  </span>
  <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
   booster
  </span>
  has stopped
  <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
   sending
  </span>
  signal files ,possible user
  <span style="BACKGROUND-COLOR: #0000ff; color: #ffffff">
   network
  </span>
  <span style="BACKGROUND-COLOR: #ff0000">
   issue
  </span>
  or BOOSTER
  <span style="BACKGROUND-COLOR: #ff0000">
   issue
  </span>
  .
  <br>
   <br>
    <br>

Here is what I tried: 这是我尝试过的：

issue_field = soup.find_all('span', {'style':'BACKGROUND-COLOR: #0000ff; color: #ffffff'}) 
issue_str = str(issue_field) 
Issue_corpora = [word.lower() for word in BeautifulSoup(issue_str,'html.parser').get_text().strip().sp‌lit(',')]
print(Issue_corpora)

Answer 1

The problem is there are texts outside the elements. 问题是元素之外有文本。 There is a duplicate question on SO already: Get text outside known element beautifulsoup SO上已经有一个重复的问题：获取已知元素beautifulsoup之外的文本

So here is the solution, maybe needs a little polishing. 所以这是解决方案，可能需要一点点抛光。 (note that variable t contains the html as text) （注意变量t包含html作为文本）

from bs4 import BeautifulSoup as bs
soup = bs(t)
text = ''
for span in soup.findAll('span'):
    text += getattr(span, 'text', '').strip() + ' '
    text += getattr(span, 'nextSibling', '').strip() + ' '

Result using this approach is: 使用这种方法的结果是：

>>> In : text
>>> Out: u'Antenna  booster has stopped sending signal files ,possible user network  issue or BOOSTER issue . '

You can replace doubled spaces with a single space or remove space before comma or define rules to handle it while looping through span elements. 您可以使用单个空格替换doubled空格，或者在逗号之前删除空格，或者在循环遍历span元素时定义规则来处理它。

Answer 2

from bs4 import BeautifulSoup
import re 

example = """your example""" 

soup = BeautifulSoup(example, "html.parser")

_text = ""
for span in soup.find_all('span', style=re.compile('BACKGROUND-COLOR:')):
    _text += "%s %s" % (span.get_text(strip=True), span.next_sibling.replace("\n", ""))

print (re.sub(" +"," ", _text))

Use re in the end to trim extra spaces. 最后使用re来修剪额外的空间。

Outputs: 输出：

Antenna booster has stopped sending signal files ,possible user network issue or BOOSTER issue . 天线助推器已停止发送信号文件，可能的用户网络问题或BOOSTER问题。

使用python BeautifulSoup进行HTML抓取

问题描述

2 个解决方案

解决方案1
0 2017-03-28 10:40:34

解决方案2
0 2017-03-28 10:49:42

使用python BeautifulSoup进行HTML抓取

问题描述

2 个解决方案

解决方案1 0 2017-03-28 10:40:34

解决方案2 0 2017-03-28 10:49:42

解决方案1
0 2017-03-28 10:40:34

解决方案2
0 2017-03-28 10:49:42