简体   繁体   中英

Python & Beautifulsoup 4 - Unable to filter classes?

I'm trying to scrape shoe sizes from this URL: http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey

What I'm trying to do is get only the sizes that are available, eg only those that aren't greyed out.

The sizes are all wrapped in a elements. The available sizes are of box class, and the unavailable ones are of box piunavailable class.

I have tried using a lambda function, ifs and CSS selectors - none seem to work. My guess it's because of the way my code is structured.

The way it's structured is as follows:

The if attempt

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a', attrs={'class': 'box'}) if 'piunavailable' not in e.attrs['class']])

The lambda attempt

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll(lambda tag: tag.name == 'a' and tag.get('class') == ['box piunavailable'])])

The CSS selector attempt

size = soup2.find('div', attrs={'class': 'psizeoptioncontainer'})
getsize = str([e.get_text() for e in size.findAll('a[class="box"]'))

So, for the URL provided, I am expecting the results to be a string (converted from list) that is all available sizes - at the time of writing this question, it should be - '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13'

Instead, I'm getting all sizes, '7.5', '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '12', '13'

Anyone have an idea how to make it work (or know an elegant solution to my issue)? Thank you in advance!

What is you are asking for is to get the a tags with a specific class box and no other classes. This can be accomplished via passing a custom function as filter to find_all .

def my_match_function(elem):
 if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
     return True

Here ''.join(elem.attrs.get('class',''))=='box' ensures that the a tag has only class box and no other class.

Let's see this in action

from bs4 import BeautifulSoup,Tag
html="""
<a>This is also not needed.</a>
<div class="box_wrapper">
<a id="itemcode_11398535" class="box piunavailable">7.5</a>
<a href="#" id="itemcode_11398536" class="box">8</a>
<a href="#" id="itemcode_11398537" class="box">8.5</a>
<a href="#" id="itemcode_11398538" class="box">9</a>
<a href="#" id="itemcode_11398539" class="box">9.5</a>
<a href="#" id="itemcode_11398540" class="box">10</a>
<a href="#" id="itemcode_11398541" class="box">10.5</a>
<a href="#" id="itemcode_11398542" class="box">11</a>
<a href="#" id="itemcode_11398543" class="box">11.5</a>
<a id="itemcode_11398544" class="box piunavailable">12</a>
<a href="#" id="itemcode_11398545" class="box">13</a>
</div>
"""
def my_match_function(elem):
 if isinstance(elem,Tag) and elem.name=='a' and ''.join(elem.attrs.get('class',''))=='box':
     return True
soup=BeautifulSoup(html,'html.parser')
my_list=[x.text for x in soup.find_all(my_match_function)]
print(my_list)

Outputs:

['8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '13']

You want a css :not pseudo class selector to exclude the other class. Using bs4 4.7.1.

sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]

In full:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.jimmyjazz.com/mens/footwear/jordan-retro-13--atmosphere-grey-/414571-016?color=Grey')  
soup = BeautifulSoup(r.content,'lxml')  
sizes = [item.text for item in soup.select('.box:not(.piunavailable)')]
print(sizes)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM