简体   繁体   中英

How can I scrape yelp reviews and star ratings into CSV using Python beautifulsoup

I have been trying to scrape both Yelp reviews and ratings using Python but I have come to a dead-end.

I was able to get the list of clean Yelp reviews using a for loop and appending it to a list, but when I try the same for the ratings I keep getting only the first rating. This is what I've tried:

for i in range(0,10):
    url = "YELP LINK GOES HERE"
    ourUrl = urllib.request.urlopen(url)
    soup = BeautifulSoup(ourUrl,'html.parser')
    review = soup.find_all('p',{'class':'comment__373c0'})

for i in range(0,len(review)):
    rating = soup.find("div",{"class":"i-stars__373c0"}).get('aria-label')
    print(rating)

output:

3 star rating
soup.find("div",{"class":"i-stars__373c0___sZu0"}).get('aria-label')

output:

'3 star rating'
print(soup.select('[aria-label*=rating]')[0]['aria-label'])

output:

3 star rating
soup.find_all('div',{'class':'i-stars__373c0___sZu0'})

Here the output gives me every star rating within the code but as this:

aria-label="3 star rating" class="i-stars__373c0___sZu0
aria-label="3 star rating" class="i-stars__373c0___sZu0
aria-label="1 star rating" class="i-stars__373c0___sZu0
aria-label="2.5 star rating" class="i-stars__373c0___sZu0
# ... etc

My expected output is:

1. Review: text here

   Rating: x star rating

I am using Jupyter Notebook.

Yelp is JavaScript-powered. Each page has a JSON payload containing the reviews in a script tag, but no reviews on the page until JS runs and injects it dynamically. If you disable JS before visiting the URL in a browser or view its page source, you'll see the static HTML you have to work with in the response to your urllib.request.urlopen(url) call.

Here's the data we want, elided for clarity:

<script type="application/ld+json">{"@context":"https://schema.org","@type":"Restaurant","name":"Code Red Restaurant &amp; Lounge" ... [more JSON] ...}</script>

One way to access this data is described in this answer in the canonical thread Web-scraping JavaScript page with Python . The strategy is to parse the JSON (or JS object) data out of the <script> tags. In this case, it's well-formed JSON so you can use json.loads to create a dictionary without any preprocessing:

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.com/biz/code-red-restaurant-and-lounge-bronx-2?osq=code+red"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
jsons = (
    json.loads(x.contents[0])
    for x in soup.select('script[type="application/ld+json"]')
)
restaurant_data = next(
    x for x in jsons if "@type" in x and x["@type"] == "Restaurant"
)
print(json.dumps(restaurant_data, indent=2))

The restaurant_data dict has the following structure:

$ py scrape_yelp_reviews.py | jq 'keys'
[
  "@context",
  "@type",
  "address",
  "aggregateRating",
  "image",
  "name",
  "priceRange",
  "review",
  "servesCuisine",
  "telephone"
]

"review" is a list of reviews, with each review having the rating and description you're interested in:

$ py scrape_yelp_reviews.py | jq '.review[0] | keys'
[
  "author",
  "datePublished",
  "description",
  "reviewRating"
]

A review looks like:

{
  "author": "Nova R.",
  "datePublished": "2021-04-04",
  "reviewRating": {
    "ratingValue": 1
  },
  "description": "Once upon a time , ... [rest of review text] ... "
}

If you're trying to get all of the reviews, you can loop over the pages in increments of 10 using the start=n query parameter:

import itertools
import json
import requests
from bs4 import BeautifulSoup

all_reviews = []

for start in itertools.count(0, 10):
    url = ("https://www.yelp.com/biz/code-red-restaurant-"
           "and-lounge-bronx-2?osq=code+red&start=%s") % start
    res = requests.get(url)

    if not res.ok:
        break

    soup = BeautifulSoup(res.text, "lxml")
    jsons = (
        json.loads(x.contents[0])
        for x in soup.select('script[type="application/ld+json"]')
    )

    try:
        restaurant_data = next(
            x for x in jsons if "@type" in x and x["@type"] == "Restaurant"
        )
    except StopIteration:
        break

    all_reviews.extend(restaurant_data["review"])

print(json.dumps(all_reviews, indent=2))
print(len(all_reviews))

If you want to speed this up, you could use asyncio or threads as described in Python download multiple files from links on pages .

Needless to say, but if Yelp changes their JSON format, the code breaks. Another option that might be more reliable is to use a live scraper like Pyppeteer or Selenium (but then if the CSS selector changes, things break, so pick your poison).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM