简体   繁体   中英

How do I parse two elements that are stuck together?

I want to get rating and numVotes from zomato.com but unfortunately it seems like the elements are stuck together. Hard to explain but I made a quick video show casing what I mean.

https://streamable.com/sdh0w

entire code: https://pastebin.com/JFKNuK2a

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")

zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})


for zomato_container in zomato_containers:
    rating = zomato_container.find('div', {'class': 'search_result_rating'})
    # numVotes = zomato_container.find("div", {"class": "rating-votes-div"})

    print("rating: ", rating.get_text().strip())
    # print("numVotes: ", numVotes.text())

You can use re module to parse the voting count:

import re
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")

zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})

for zomato_container in zomato_containers:
    print('name:', zomato_container.select_one('.result-title').get_text(strip=True))
    print('rating:', zomato_container.select_one('.rating-popup').get_text(strip=True))
    votes = ''.join( re.findall(r'\d', zomato_container.select_one('[class^="rating-votes"]').text) )
    print('votes:', votes)
    print('*' * 80)

Prints:

name: The Original Ghirardelli Ice Cream and Chocolate...
rating: 4.9
votes: 344
********************************************************************************
name: Tadich Grill
rating: 4.6
votes: 430
********************************************************************************
name: Delfina
rating: 4.8
votes: 718
********************************************************************************

...and so on.

OR:

If you don't want to use re , you can use str.split() :

votes = zomato_container.select_one('[class^="rating-votes"]').get_text(strip=True).split()[0]

According to requirements in your clip you should alter you selectors to be more specific so as to target the appropriate child elements (rather than parent). At present, by targeting parent you are getting the unwanted extra child. To get the appropriate ratings element you can use a css attribute = value with starts with operator.

This

[class^=rating-votes-div]

says match on elements with class attribute whose values starts with rating-votes-div


Visual:

在此处输入图片说明


import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")

zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})


for zomato_container in zomato_containers:
    name = zomato_container.select_one('.result-title').text.strip()
    rating = zomato_container.select_one('.rating-popup').text.strip()
    numVotes = zomato_container.select_one('[class^=rating-votes-div]').text 
    print('name: ', name)
    print('rating: ' , rating)
    print('votes: ', numVotes)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM