简体   繁体   English

使用 Beautiful Soup 从非类部分获取数据

[英]Using Beautiful Soup to get data from non-class section

I am still very novice and learning python and beautiful soup.我还是个新手,正在学习蟒蛇和美汤。 I have gotten hung up on how to get text from a non-class piece of HTML.我一直在想如何从非类的 HTML 中获取文本。

This is the snippet of HTML I'm working with:这是我正在使用的 HTML 片段:

<section class="userbody">
    <script type="text/javascript"></script>
    <figure class="iw">
        <div id="ci">
            <img id="iwi" title="image 2" alt="" src="http://images.craigslist.org/00C0C_daJm4U9yU5B_600x450.jpg" style="min-width: inherit; min-height: 450px;"></img>
        </div>
        <div id="thumbs"></div>
    </figure>
    <div class="mapAndAttrs">
        <div class="mapbox">
            <div id="map" class="leaflet-container leaflet-fade-anim" data-longitude="-84.072447" data-latitude="33.908534" tabindex="0">
                <div class="leaflet-map-pane" style="transform: translate(0px, 0px);"></div>
                <div class="leaflet-control-container">
                    <div class="leaflet-top leaflet-left"></div>
                    <div class="leaflet-top leaflet-right"></div>
                    <div class="leaflet-bottom leaflet-left"></div>
                    <div class="leaflet-bottom leaflet-right">
                        <div class="leaflet-control-attribution leaflet-control"></div>
                    </div>
                </div>
            </div>
            <div class="mapaddress">

                Some Address

            </div>
        </div>
        <div class="attributes"></div>
    </div>
    <section id="postingbody">
            some posting info
            <br></br>
             more posting info
             <br></br>
    </section>
    <section class="cltags"></section>
    <div class="postinginfos"></div>
</section>

I have been able to pull the address information:我已经能够提取地址信息:

     for address in soup.findAll("div", { "class" : "mapaddress" }):
       addressText = ''.join(address.findAll(text=True))

It appears findAll() doesn't work for tags that have don't have a class as I tried doing in似乎 findAll() 不适用于我尝试过的没有类的标签

     for post in soup.findall("section", { "id" : "postingbody" }):
       postText = ''.join(post.findAll(text=True))

How would grab the text in section id="postingbody"?如何获取 id="postingbody" 部分中的文本?

Well you can do the following, taking into consideration that s is the html string:考虑到s是 html 字符串,您可以执行以下操作:

from bs4 import BeautifulSoup

soup = BeautifulSoup(s)
print soup.find(attrs={'id' : 'postingbody'})

Output:输出:

<section id="postingbody">
            some posting info
            <br/>
             more posting info
             <br/>
</section>

In addition to Games Brainiac's answer: To get the text just put .text behind it.除了 Games Brainiac 的回答:要获取文本,只需将 .text 放在其后面。

So:所以:

print soup.find(attrs={'id' : 'postingbody'}).text

如果您使用的是 BeautifulSoup4,您可以这样做:

element = soup.find(id="postingbody")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM