简体   繁体   English

用美丽的汤提取可变元素

[英]Extracting variable element with beautiful soup

I'm trying to scrape the rating of yelp restaurants, with no luck. 我正在努力提高yelp餐厅的评级,但没有运气。 I'm using beautiful soup to do so 我正在用美丽的汤来做

Basically, the source code looks like this: 基本上,源代码如下所示:

<div class="i-stars i-stars--regular-5 rating-large" title="5.0 star rating">
    <img class="offscreen" height="303" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png" width="84" alt="5.0 star rating">
</div>

As you can see, the class name changes given the rating, so I'm trying to have somekind of “partial” match with my find function 如您所见,类名在给定等级的情况下会发生变化,因此我试图与我的find函数进行某种“部分”匹配

rating = r.find_all('div', {'class':'i-stars i-stars--regular'}).get('title', 'No title attribute')
print(rating)

But it does not seem to work. 但这似乎不起作用。

Use regex - 使用regex -

import re
rating = [x.get('title', 'No title attribute') for x in r.find_all('div', attrs={"class": re.compile("i-stars i-stars--regular-")})]
print(rating)

if you need partial search for class use. 如果您需要部分搜索以用于班级使用。

s = """<div class="i-stars i-stars--regular-5 rating-large" title="5.0 star rating">
    <img class="offscreen" height="303" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png" width="84" alt="5.0 star rating">
</div>"""

from bs4 import BeautifulSoup
r = BeautifulSoup(s, "html.parser")
rating = r.find_all('div')
for i in rating:
    if "i-stars i-stars--regular-5" in " ".join(i["class"]):
        print(i.get('title', 'No title attribute')) 

Output: 输出:

5.0 star rating

You can do it by using a function : 您可以通过使用一个函数来做到这一点:

from bs4 import BeautifulSoup
import re
html_text = '<div class="i-stars i-stars--regular-5 rating-large" title="5.0 star rating">\n    <img class="offscreen" height="303" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png" width="84" alt="5.0 star rating">\n</div'
soup = BeautifulSoup(html_text, 'html.parser')
soup.find_all(class_ = lambda x:re.search(r"i\-stars i\-stars\-\-regular\-\d rating\-\w+",x))

give out a regex pattern within it based on your requirement. 根据您的要求给出一个正则表达式模式。 Here I've marked it for any rating size and star. 在这里,我已将其标记为任何额定大小和星级。

find_all() returns a list of all the matches. find_all()返回所有匹配项的列表。 So, you can use .get() on a list. 因此,您可以在列表上使用.get() To get only one item, you'll have to use find_all(...)[0] , or, even better, the find() function; 要只获得一项,就必须使用find_all(...)[0] ,或者甚至更好的是使用find()函数; which returns the first match. 返回第一个匹配项。

Also, since the class name changes according to the rating, you can use the classes which are constant and add them in the list. 另外,由于类别名称根据等级而变化,因此可以使用恒定的类别并将其添加到列表中。 For example, here, i-starts and rating-large seem to be constant. 例如,在这里, i-startsrating-large似乎是恒定的。 So, you can use this: 因此,您可以使用以下代码:

html = '''
<div class="i-stars i-stars--regular-5 rating-large" title="5.0 star rating">
    <img class="offscreen" height="303" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png" width="84" alt="5.0 star rating">
</div>'''
soup = BeautifulSoup(html, 'lxml')
rating = soup.find('div', {'class': ['i-stars', 'rating-large']}).get('title', 'No title attribute')
print(rating)
# 5.0 star rating

But, you have to be careful while using the classes in a list, as it will also match the classes that have only one of the classes, like class="i-stars" . 但是,在使用列表中的类时必须要小心,因为它还会匹配仅包含一个类的类,例如class="i-stars" If such cases exist, you can use the following selector: 如果存在这种情况,则可以使用以下选择器:

ratings = soup.select_one('div.i-stars.rating-large').get('title', 'No title attribute')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM