简体   繁体   English

Python:使用BeautifulSoup从HTML提取分隔文本

[英]Python: Extract Seperated Text from HTML with BeautifulSoup

I have the following HTML repeated several times on a page (please do not judge): 我在页面上将以下HTML重复了几次(请不要判断):

 <div class="container">
    <div class="image">
      <a href="#" title="#" class="#">
        <img src="img.jpg" alt="#" class="#">
      </a>
    </div>
    <div class="text">
        <a href="#">
          <h4 class="h4-class">{TITLE}</h4>
        {SOME TEXT 1}<br />
        <h5><img src="img.jpg" alt="#" /> {SOME TEXT 2}</h5>
        {SOME TEXT 3}      </a>
    </div>
  </div>

I would like to extract {TITLE} , {SOME TEXT 1} , {SOME TEXT 2} and {SOME TEXT 3} 我想提取{TITLE}{SOME TEXT 1}{SOME TEXT 2}{SOME TEXT 3}

My code is as follows: 我的代码如下:

from BeautifulSoup import BeautifulSoup as bs
import urllib2
html = urllib2.urlopen('text')
soup = bs(html)
divs = soup.findAll("div", { "class" : "text" })

for div in divs:
    inner_text = div.text
    strings = inner_text.split("\n")
    print strings[0] ## I want this to print just {TITLE}

On printing it out, it prints one line connecting all the values eg 打印出来时,它打印一行连接所有值,例如

{TITLE}{SOME TEXT 1}{SOME TEXT 2}{SOME TEXT 3}

Is there anyway around this? 有没有办法解决? What have I missed? 我错过了什么?

You can prettify ( see documentation here ) the div content first and then manipulate each line as needed. 您可以prettify见文档这里第一)的DIV内容,然后操纵各行根据需要。 This will work if the divs with class name text have same structure. 如果具有类名text的div具有相同的结构,则此方法将起作用。

Code (Python 2): 代码(Python 2):

from BeautifulSoup import BeautifulSoup as bs

html = '''
<div class="container">
    <div class="image">
      <a href="#" title="#" class="#">
        <img src="img.jpg" alt="#" class="#">
      </a>
    </div>
    <div class="text">
        <a href="#">
          <h4 class="h4-class">{TITLE}</h4>
        {SOME TEXT 1}<br />
        <h5><img src="img.jpg" alt="#" /> {SOME TEXT 2}</h5>
        {SOME TEXT 3}      </a>
    </div>
  </div>
'''
soup = bs(html)
divs = soup.findAll("div",{"class":"text"})
for div in divs:
    pretty_div = div.prettify()
    content_list = pretty_div.split("\n")
    content_list = [s.strip() for s in content_list]
    print content_list[3]
    print content_list[5]
    print content_list[9]
    print content_list[11]

Output: 输出:

{TITLE}
{SOME TEXT 1}
{SOME TEXT 2}
{SOME TEXT 3}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM