soup = BeautifulSoup(html)
boxes = soup.findAll("div", { "class" : re.compile(r'\bmixDesc\b') })
I think I got only boxes of class 'mixDesc'.
So I'm debugging to be sure
count = 0
for box in boxes :
count = count + 1
print "JWORG box {0}".format(count)
print "JWORG box len {0}".format(len(box))
print box
I've only 10 divs with mixDesc
class in the parsed html file
But I got 30 boxes and a lot (20 out of 30) are printed as
[]
Can you explain why this happens? Why findAll grab this empty tags ? Or ... What else mistake have I take ?
EDIT 1:
I'm using this to program a xbmc plugin, so I'm using the only version available to me
EDIT 2:
I cannot copy/paste all of html, but I'm scraping this page: http://www.jw.org/it/video/?start=70
So you can see html source to help me.
EDIT 3: this is my xbmc log, please not I printed even a counter and len(box)
20:27:54 T:5356 NOTICE: JWORG box 1
20:27:54 T:5356 NOTICE: JWORG box len 5
20:27:54 T:5356 NOTICE: [<div class="syn-img sqr mixDesc">
<a href="/it/cosa-dice-la-Bibbia/famiglia/bambini/diventa-amico-di-geova/cantici/120-felice-chi-mette-in-pratica-ci%C3%B2-che-ode/" class="jsDownload jsVideoModal jsCoverDoc" data-jsonurl="/apps/TRGCHlZRQVNYVrXF?output=json&pub=pksn&fileformat=mp4&alllangs=1&track=120&langwritten=I&txtCMSLang=I" data-coverurl="/it/cosa-dice-la-Bibbia/famiglia/bambini/diventa-amico-di-geova/cantici/120-felice-chi-mette-in-pratica-ci%C3%B2-che-ode/" data-onpagetitle="Cantico 120: Felice chi mette in pratica ciò che ode" title="Play o download | Cantico 120: Felice chi mette in pratica ciò che ode" data-mid="1102013357">
<span class="jsRespImg" data-img-type="sqr" data-img-size-lg="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_lg.jpg" data-img-size-md="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_md.jpg" data-img-size-sm="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_sm.jpg" data-img-size-xs="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_xs.jpg"></span></a><noscript><img src="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357_I/1102013357_univ_sqr_xs.jpg" alt="" /></noscript>
<div style="display:none;" class="jsVideoPoster mid1102013357" data-src="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_lsr_lg.jpg" data-alt=""></div>
</div>]
20:27:54 T:5356 NOTICE: JWORG box 2
20:27:54 T:5356 NOTICE: JWORG box len 7
20:27:54 T:5356 NOTICE: []
20:27:54 T:5356 NOTICE: JWORG box 3
20:27:54 T:5356 NOTICE: JWORG box len 7
20:27:54 T:5356 NOTICE: []
EDIT 4:
Ok, there 30 divs, because the're nested, but WHY are they empty? and how to filter these out ?
The problem is that by default findAll()
performs recursive search and since there are nested div contains bmixDesc
class - you are getting these results.
Pass recursive=False
to the findAll
and search for divs inside the parent div with id=videosIndexList
.
And, also, BeautifulSoup3
is no longer maintained - switch to BeautifulSoup4 and use find_all() .
Here's what the code should look like (using BeautifulSoup4
):
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://www.jw.org/it/video/?start=70'))
div = soup.find('div', {'id': 'videosIndexList'})
boxes = div.find_all("div", { "class" : re.compile(r'\bmixDesc\b')}, recursive=False)
for box in boxes:
print box.text
This will get you only top-level divs (10 boxes).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.