简体   繁体   中英

Python BeautifulSoup get text first tag

I need to get the text of the tag to the first level of li tag with BeautifulSoup in python.

The problem is that the tags contain other li tags which in turn contain other tags to.

Example html:

<li>
   <a href="http://lol.lol">Text1</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text1</a><-- DON'T GET THIS
   </li>
</li>
<li>
   <a href="http://lol.lol">Text2</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text2-2</a><-- DON'T GET THIS
   </li>
</li>

EDIT:

I've been testing and I do not get out only the first a tags.

This is the original piece of that I try to extract:

<div id="categories_block_left" class="block block-highlighted">
<h4 class="title_block">
<span class="icon-box fa fa-bars"></span>
RELOJES
</h4>
<div class="block_content" style="">
<ul class="list-block list-group bullet tree dynamized" style="display: block;">
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/50-outlet" title="OUTLET">
OUTLET
<span id="leo-cat-50" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/47-adidas" title="Adidas">
Adidas
<span id="leo-cat-47" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/125-miss-sixty" title="Miss Sixty">
Miss Sixty
<span id="leo-cat-125" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/49-converse" title="Converse">
Converse
<span id="leo-cat-49" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/61-armand-basi" title="Armand Basi">
Armand Basi
<span id="leo-cat-61" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/79-marea" title="Marea">
Marea
<span id="leo-cat-79" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/86-marc-ecko" title="Marc Ecko">
Marc Ecko
<span id="leo-cat-86" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/107-festina" title="Festina">
Festina
<span id="leo-cat-107" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/135-seiko" title="Seiko">
Seiko
<span id="leo-cat-135" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/221-relojes-swatch-liquidar" title="Relojes Swatch liquidar">
Relojes Swatch liquidar
<span id="leo-cat-221" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/184-lotus" title="Lotus">
Lotus
<span id="leo-cat-184" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/195-lotus-hombre" title="Lotus Hombre">
Lotus Hombre
<span id="leo-cat-195" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/196-lotus-mujer" title="Lotus Mujer">
Lotus Mujer
<span id="leo-cat-196" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/236-lotus-infantil" title="Lotus Infantil">
Lotus Infantil
<span id="leo-cat-236" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/218-daniel-wellington" title="Daniel Wellington">
Daniel Wellington
<span id="leo-cat-218" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/197-viceroy" title="Viceroy">
Viceroy
<span id="leo-cat-197" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/198-viceroy-hombre" title="Viceroy Hombre">
Viceroy Hombre
<span id="leo-cat-198" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/199-viceroy-mujer" title="Viceroy Mujer">
Viceroy Mujer
<span id="leo-cat-199" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/235-viceroy-infantil" title="Viceroy Infantil">
Viceroy Infantil
<span id="leo-cat-235" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/51-ice-watch" title="Ice watch">
Ice watch
<span id="leo-cat-51" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/64-relojes-swatch" title="Relojes Swatch">
Relojes Swatch
<span id="leo-cat-64" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/80-mark-maddox" title="Mark Maddox">
Mark Maddox
<span id="leo-cat-80" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/81-ferrari" title="Ferrari">
Ferrari
<span id="leo-cat-81" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/173-relojes-cadete" title="Relojes Cadete">
Relojes Cadete
<span id="leo-cat-173" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/200-tous" title="Tous">
Tous
<span id="leo-cat-200" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/201-tous-kids" title="Tous Kids">
Tous Kids
<span id="leo-cat-201" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/203-tous-mujer" title="Tous Mujer">
Tous Mujer
<span id="leo-cat-203" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/204-tous-hombre" title="Tous Hombre">
Tous Hombre
<span id="leo-cat-204" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/220-certina" title="Certina">
Certina
<span id="leo-cat-220" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</div>
</div>

And this is the code as I try extract:

req2 = requests.get(url2)
        html2 = BeautifulSoup(req2.text)
        catmenu = html2.find('div', {'id':'categories_block_left'})
        categorys = catmenu.find_all('li', recursive=False)
        for cat in categorys:
            categor = cat.find('a').getText()
            print ("   SubCategor:%s" % categor)

But return no value, I just need to get the first a tags.
Example:

OUTLET,
Lotus,
Daniel Wellington,
Viceroy,
Ice watch,
Relojes Swatch,
Mark Maddox,
Ferrari,
Relojes Cadete,
Tous,
Certina

You may specify recursive=False in find_all method, this will only return top-level li tags:

In [62]: soup.find_all('li', recursive=False)
Out[62]: 
[<li>
 <a href="http://lol.lol">Text1</a>
 <li>
 <a href="http://lol.lol">Text1</a>
 </li>
 </li>, <li>
 <a href="http://lol.lol">Text2</a>
 <li>
 <a href="http://lol.lol">Text2-2</a>
 </li></li>]

Then you may retrieve text from first a tag of each li :

In [63]: [li.find('a').text for li in soup.find_all('li', recursive=False)]
Out[63]: ['Text1', 'Text2']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM