简体   繁体   English

美丽的汤和Python,嵌套元素

[英]Beautiful Soup & Python, Nested Elements

I am attempting to scrape nested elements via BeautifulSoup and I have been pulling my hair out for a couple days now. 我试图通过BeautifulSoup刮掉嵌套的元素,我已经把头发拉了几天了。 I am, by far, a novice -- so I hope the simplicity of this question does not offend anyone. 到目前为止,我是一名新手 - 所以我希望这个问题的简单性不会冒犯任何人。 Still, any help in any capacity would be greatly appreciated. 尽管如此,任何能力的任何帮助将不胜感激。

Here is the html I'm attempting to scrape. 这是我试图刮的html。

        <div id="specs" class="pane">
           <div class="col">
              <ul class="list">
                 <li>
                    <ul>
                       <li><b>width</b>2</li>
                       <li><b>length</b>1</li>
                       <li><b>color</b>blue</li>
                       <li><b>metal</b>steel</li>
                    </ul>
                 </li>
              </ul>
           </div>
        </div>

And in a perfect world, here is my result... 在一个完美的世界里,这是我的结果......

width, 2
length, 1
color, blue
metal, steel

While I've come close, I know now this can't be the answer...yet, at the same time, I can't seem to loop through the li elements. 虽然我已经接近了,但我知道现在这不是答案......然而,与此同时,我似乎无法遍历li元素。

div = div.find("div", {"id":"specifications"})
result = [i for i in div.find('li')]

If anyone can just push aa beginner in the right direction, it would be greatly greatly appreciation, and thank you in advance for any insight! 如果有人能够把初学者推向正确的方向,那将非常感激,并提前感谢您的任何见解!

You can use CSS selector via select() to find the target b elements, for example : 您可以通过select()使用CSS选择器来查找目标b元素,例如:

from bs4 import BeautifulSoup
raw = '''<div id="specs" class="pane">
           <div class="col">
              <ul class="list">
                 <li>
                    <ul>
                       <li><b>width</b>2</li>
                       <li><b>length</b>1</li>
                       <li><b>color</b>blue</li>
                       <li><b>metal</b>steel</li>
                    </ul>
                 </li>
              </ul>
           </div>
        </div>'''
soup = BeautifulSoup(raw, "lxml")

result = soup.select("div#specs b")    
for r in result:
    print r.get_text(), r.next_sibling

output : 输出:

width 2
length 1
color blue
metal steel

The following is a pure lxml.html alternative for comparison (since OP seems interested in lxml , judging from his comment below). 以下是用于比较的纯lxml.html替代方案(因为OP似乎对lxml感兴趣,从下面的评论来看)。 The output is exactly the same as BS snippet above. 输出与上面的BS片段完全相同。

from lxml import html
raw = '''assume the same XML as in the previous snippet'''
root = html.fromstring(raw)

result = root.cssselect("div#specs b")
for b in result:
    print b.text, b.tail

lxml supports both XPath (via xpath() ) and CSS selector (via cssselect() ), and lxml is fast . lxml支持XPath(通过xpath() )和CSS选择器(通过cssselect() ), lxml很快

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM