Python和Beautifulsoup：findAll的奇怪行为

Question

soup        = BeautifulSoup(html)
boxes       = soup.findAll("div", { "class" : re.compile(r'\bmixDesc\b') })

I think I got only boxes of class 'mixDesc'. 我想我只有一些类'mixDesc'。

So I'm debugging to be sure 所以我正在调试以确定

count = 0
for box in boxes :
    count = count + 1

    print "JWORG box {0}".format(count)
    print "JWORG box len {0}".format(len(box))
    print box

I've only 10 divs with mixDesc class in the parsed html file 我在解析的html文件中只有10个带有mixDesc类的div

But I got 30 boxes and a lot (20 out of 30) are printed as 但我有30个盒子，很多（30个中的20个）打印出来

[]

Can you explain why this happens? 你能解释一下为什么会这样吗？ Why findAll grab this empty tags ? 为什么findAll抓住这个空标签？ Or ... What else mistake have I take ? 或者......我还有什么错误？

EDIT 1: 编辑1：

I'm using this to program a xbmc plugin, so I'm using the only version available to me 我正在使用它编写一个xbmc插件，所以我使用的是唯一可用的版本

EDIT 2: 编辑2：

I cannot copy/paste all of html, but I'm scraping this page: http://www.jw.org/it/video/?start=70 我无法复制/粘贴所有的HTML，但我正在抓取这个页面： http ：//www.jw.org/it/video/？start = 70

So you can see html source to help me. 所以你可以看到html源代码来帮助我。

EDIT 3: this is my xbmc log, please not I printed even a counter and len(box) 编辑3：这是我的xbmc日志，请不要我打印一个柜台和len（盒子）

20:27:54 T:5356  NOTICE: JWORG box 1
20:27:54 T:5356  NOTICE: JWORG box len 5
20:27:54 T:5356  NOTICE: [<div class="syn-img sqr mixDesc">
                                            <a href="/it/cosa-dice-la-Bibbia/famiglia/bambini/diventa-amico-di-geova/cantici/120-felice-chi-mette-in-pratica-ci%C3%B2-che-ode/" class="jsDownload jsVideoModal jsCoverDoc" data-jsonurl="/apps/TRGCHlZRQVNYVrXF?output=json&amp;pub=pksn&amp;fileformat=mp4&amp;alllangs=1&amp;track=120&amp;langwritten=I&amp;txtCMSLang=I" data-coverurl="/it/cosa-dice-la-Bibbia/famiglia/bambini/diventa-amico-di-geova/cantici/120-felice-chi-mette-in-pratica-ci%C3%B2-che-ode/" data-onpagetitle="Cantico 120: Felice chi mette in pratica ciÃ² cheÂ ode" title="Play o download | Cantico 120: Felice chi mette in pratica ciÃ² cheÂ ode" data-mid="1102013357">
                                            <span class="jsRespImg" data-img-type="sqr" data-img-size-lg="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_lg.jpg" data-img-size-md="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_md.jpg" data-img-size-sm="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_sm.jpg" data-img-size-xs="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_xs.jpg"></span></a><noscript><img src="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357_I/1102013357_univ_sqr_xs.jpg" alt="" /></noscript>

                                            <div style="display:none;" class="jsVideoPoster mid1102013357" data-src="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_lsr_lg.jpg" data-alt=""></div>
                                            </div>]
20:27:54 T:5356  NOTICE: JWORG box 2
20:27:54 T:5356  NOTICE: JWORG box len 7
20:27:54 T:5356  NOTICE: []
20:27:54 T:5356  NOTICE: JWORG box 3
20:27:54 T:5356  NOTICE: JWORG box len 7
20:27:54 T:5356  NOTICE: []

EDIT 4: 编辑4：

Ok, there 30 divs, because the're nested, but WHY are they empty? 好的，有30个div，因为它们是嵌套的，但为什么它们是空的？ and how to filter these out ? 以及如何过滤掉这些？

Answer 1

The problem is that by default findAll() performs recursive search and since there are nested div contains bmixDesc class - you are getting these results. 问题是默认情况下findAll()执行递归搜索，因为嵌套div包含bmixDesc类 - 您将获得这些结果。

Pass recursive=False to the findAll and search for divs inside the parent div with id=videosIndexList . 将recursive=False传递给findAll并使用id=videosIndexList搜索父div内的div。

And, also, BeautifulSoup3 is no longer maintained - switch to BeautifulSoup4 and use find_all() . 而且，也BeautifulSoup3不再保留-切换到BeautifulSoup4和使用find_all（）。

Here's what the code should look like (using BeautifulSoup4 ): 这是代码应该是什么样子（使用BeautifulSoup4 ）：

import re
from urllib2 import urlopen
from bs4 import BeautifulSoup


soup = BeautifulSoup(urlopen('http://www.jw.org/it/video/?start=70'))

div = soup.find('div', {'id': 'videosIndexList'})
boxes = div.find_all("div", { "class" : re.compile(r'\bmixDesc\b')}, recursive=False)

for box in boxes:
    print box.text

This will get you only top-level divs (10 boxes). 这将只为您提供顶级div（10盒）。

Python和Beautifulsoup：findAll的奇怪行为

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-03-27 19:45:09

Python和Beautifulsoup：findAll的奇怪行为

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-03-27 19:45:09

解决方案1
2 已采纳 2014-03-27 19:45:09