I have been trying to scrape both Yelp reviews and ratings using Python but I have come to a dead-end.
I was able to get the list of clean Yelp reviews using a for
loop and appending it to a list, but when I try the same for the ratings I keep getting only the first rating. This is what I've tried:
for i in range(0,10):
url = "YELP LINK GOES HERE"
ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl,'html.parser')
review = soup.find_all('p',{'class':'comment__373c0'})
for i in range(0,len(review)):
rating = soup.find("div",{"class":"i-stars__373c0"}).get('aria-label')
print(rating)
output:
3 star rating
soup.find("div",{"class":"i-stars__373c0___sZu0"}).get('aria-label')
output:
'3 star rating'
print(soup.select('[aria-label*=rating]')[0]['aria-label'])
output:
3 star rating
soup.find_all('div',{'class':'i-stars__373c0___sZu0'})
Here the output gives me every star rating within the code but as this:
aria-label="3 star rating" class="i-stars__373c0___sZu0
aria-label="3 star rating" class="i-stars__373c0___sZu0
aria-label="1 star rating" class="i-stars__373c0___sZu0
aria-label="2.5 star rating" class="i-stars__373c0___sZu0
# ... etc
My expected output is:
1. Review: text here
Rating: x star rating
I am using Jupyter Notebook.
Yelp is JavaScript-powered. Each page has a JSON payload containing the reviews in a script tag, but no reviews on the page until JS runs and injects it dynamically. If you disable JS before visiting the URL in a browser or view its page source, you'll see the static HTML you have to work with in the response to your urllib.request.urlopen(url)
call.
Here's the data we want, elided for clarity:
<script type="application/ld+json">{"@context":"https://schema.org","@type":"Restaurant","name":"Code Red Restaurant & Lounge" ... [more JSON] ...}</script>
One way to access this data is described in this answer in the canonical thread Web-scraping JavaScript page with Python . The strategy is to parse the JSON (or JS object) data out of the <script>
tags. In this case, it's well-formed JSON so you can use json.loads
to create a dictionary without any preprocessing:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.com/biz/code-red-restaurant-and-lounge-bronx-2?osq=code+red"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
jsons = (
json.loads(x.contents[0])
for x in soup.select('script[type="application/ld+json"]')
)
restaurant_data = next(
x for x in jsons if "@type" in x and x["@type"] == "Restaurant"
)
print(json.dumps(restaurant_data, indent=2))
The restaurant_data
dict has the following structure:
$ py scrape_yelp_reviews.py | jq 'keys'
[
"@context",
"@type",
"address",
"aggregateRating",
"image",
"name",
"priceRange",
"review",
"servesCuisine",
"telephone"
]
"review"
is a list of reviews, with each review having the rating and description you're interested in:
$ py scrape_yelp_reviews.py | jq '.review[0] | keys'
[
"author",
"datePublished",
"description",
"reviewRating"
]
A review looks like:
{
"author": "Nova R.",
"datePublished": "2021-04-04",
"reviewRating": {
"ratingValue": 1
},
"description": "Once upon a time , ... [rest of review text] ... "
}
If you're trying to get all of the reviews, you can loop over the pages in increments of 10 using the start=n
query parameter:
import itertools
import json
import requests
from bs4 import BeautifulSoup
all_reviews = []
for start in itertools.count(0, 10):
url = ("https://www.yelp.com/biz/code-red-restaurant-"
"and-lounge-bronx-2?osq=code+red&start=%s") % start
res = requests.get(url)
if not res.ok:
break
soup = BeautifulSoup(res.text, "lxml")
jsons = (
json.loads(x.contents[0])
for x in soup.select('script[type="application/ld+json"]')
)
try:
restaurant_data = next(
x for x in jsons if "@type" in x and x["@type"] == "Restaurant"
)
except StopIteration:
break
all_reviews.extend(restaurant_data["review"])
print(json.dumps(all_reviews, indent=2))
print(len(all_reviews))
If you want to speed this up, you could use asyncio or threads as described in Python download multiple files from links on pages .
Needless to say, but if Yelp changes their JSON format, the code breaks. Another option that might be more reliable is to use a live scraper like Pyppeteer or Selenium (but then if the CSS selector changes, things break, so pick your poison).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.