Python：使用BeautifulSoup从HTML提取分隔文本

Question

I have the following HTML repeated several times on a page (please do not judge): 我在页面上将以下HTML重复了几次（请不要判断）：

 <div class="container">
    <div class="image">
      <a href="#" title="#" class="#">
        <img src="img.jpg" alt="#" class="#">
      </a>
    </div>
    <div class="text">
        <a href="#">
          <h4 class="h4-class">{TITLE}</h4>
        {SOME TEXT 1}<br />
        <h5><img src="img.jpg" alt="#" /> {SOME TEXT 2}</h5>
        {SOME TEXT 3}      </a>
    </div>
  </div>

I would like to extract {TITLE} , {SOME TEXT 1} , {SOME TEXT 2} and {SOME TEXT 3} 我想提取{TITLE} ， {SOME TEXT 1} ， {SOME TEXT 2}和{SOME TEXT 3}

My code is as follows: 我的代码如下：

from BeautifulSoup import BeautifulSoup as bs
import urllib2
html = urllib2.urlopen('text')
soup = bs(html)
divs = soup.findAll("div", { "class" : "text" })

for div in divs:
    inner_text = div.text
    strings = inner_text.split("\n")
    print strings[0] ## I want this to print just {TITLE}

On printing it out, it prints one line connecting all the values eg 打印出来时，它打印一行连接所有值，例如

{TITLE}{SOME TEXT 1}{SOME TEXT 2}{SOME TEXT 3}

Is there anyway around this? 有没有办法解决？ What have I missed? 我错过了什么？

Answer 1

You can prettify ( see documentation here ) the div content first and then manipulate each line as needed. 您可以prettify （见文档这里第一）的DIV内容，然后操纵各行根据需要。 This will work if the divs with class name text have same structure. 如果具有类名text的div具有相同的结构，则此方法将起作用。

Code (Python 2): 代码（Python 2）：

from BeautifulSoup import BeautifulSoup as bs

html = '''
<div class="container">
    <div class="image">
      <a href="#" title="#" class="#">
        <img src="img.jpg" alt="#" class="#">
      </a>
    </div>
    <div class="text">
        <a href="#">
          <h4 class="h4-class">{TITLE}</h4>
        {SOME TEXT 1}<br />
        <h5><img src="img.jpg" alt="#" /> {SOME TEXT 2}</h5>
        {SOME TEXT 3}      </a>
    </div>
  </div>
'''
soup = bs(html)
divs = soup.findAll("div",{"class":"text"})
for div in divs:
    pretty_div = div.prettify()
    content_list = pretty_div.split("\n")
    content_list = [s.strip() for s in content_list]
    print content_list[3]
    print content_list[5]
    print content_list[9]
    print content_list[11]

Output: 输出：

{TITLE}
{SOME TEXT 1}
{SOME TEXT 2}
{SOME TEXT 3}

Python：使用BeautifulSoup从HTML提取分隔文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-02-22 12:04:55

Python：使用BeautifulSoup从HTML提取分隔文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-02-22 12:04:55

解决方案1
1 已采纳 2017-02-22 12:04:55