如何使用 Python beautifulsoup 將 yelp 評論和星級評分刮到 CSV 中

Question

我一直在嘗試使用 Python 抓取 Yelp 評論和評分，但我走到了死胡同。

我能夠使用for循環獲取干凈的 Yelp 評論列表並將其附加到列表中，但是當我嘗試相同的評分時，我一直只獲得第一個評分。 這是我嘗試過的：

for i in range(0,10):
    url = "YELP LINK GOES HERE"
    ourUrl = urllib.request.urlopen(url)
    soup = BeautifulSoup(ourUrl,'html.parser')
    review = soup.find_all('p',{'class':'comment__373c0'})

for i in range(0,len(review)):
    rating = soup.find("div",{"class":"i-stars__373c0"}).get('aria-label')
    print(rating)

輸出：

3 star rating

soup.find("div",{"class":"i-stars__373c0___sZu0"}).get('aria-label')

輸出：

'3 star rating'

print(soup.select('[aria-label*=rating]')[0]['aria-label'])

輸出：

3 star rating

soup.find_all('div',{'class':'i-stars__373c0___sZu0'})

這里的輸出為我提供了代碼中的每個星級評分，但如下所示：

aria-label="3 star rating" class="i-stars__373c0___sZu0
aria-label="3 star rating" class="i-stars__373c0___sZu0
aria-label="1 star rating" class="i-stars__373c0___sZu0
aria-label="2.5 star rating" class="i-stars__373c0___sZu0
# ... etc

我的預期輸出是：

1. Review: text here

   Rating: x star rating

我正在使用 Jupyter Notebook。

Answer 1

Yelp 是 JavaScript 驅動的。 每個頁面都有一個 JSON 負載，其中包含腳本標簽中的評論，但在 JS 運行並動態注入之前頁面上沒有評論。 如果您在瀏覽器中訪問 URL 或查看其頁面源之前禁用 JS，您將在對urllib.request.urlopen(url)調用的響應中看到您必須使用的靜態 HTML。

這是我們想要的數據，為清楚起見省略了：

<script type="application/ld+json">{"@context":"https://schema.org","@type":"Restaurant","name":"Code Red Restaurant &amp; Lounge" ... [more JSON] ...}</script>

使用 Python的規范線程Web-scraping JavaScript 頁面中的此答案中描述了訪問此數據的一種方法。 策略是從<script>標簽中解析 JSON（或 JS 對象）數據。 在這種情況下，它是格式良好的 JSON，因此您可以使用json.loads來創建字典而無需任何預處理：

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.com/biz/code-red-restaurant-and-lounge-bronx-2?osq=code+red"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
jsons = (
    json.loads(x.contents[0])
    for x in soup.select('script[type="application/ld+json"]')
)
restaurant_data = next(
    x for x in jsons if "@type" in x and x["@type"] == "Restaurant"
)
print(json.dumps(restaurant_data, indent=2))

restaurant_data字典具有以下結構：

$ py scrape_yelp_reviews.py | jq 'keys'
[
  "@context",
  "@type",
  "address",
  "aggregateRating",
  "image",
  "name",
  "priceRange",
  "review",
  "servesCuisine",
  "telephone"
]

"review"是評論列表，每條評論都有您感興趣的評分和描述：

$ py scrape_yelp_reviews.py | jq '.review[0] | keys'
[
  "author",
  "datePublished",
  "description",
  "reviewRating"
]

評論看起來像：

{
  "author": "Nova R.",
  "datePublished": "2021-04-04",
  "reviewRating": {
    "ratingValue": 1
  },
  "description": "Once upon a time , ... [rest of review text] ... "
}

如果您想獲取所有評論，您可以使用start=n查詢參數以 10 為增量遍歷頁面：

import itertools
import json
import requests
from bs4 import BeautifulSoup

all_reviews = []

for start in itertools.count(0, 10):
    url = ("https://www.yelp.com/biz/code-red-restaurant-"
           "and-lounge-bronx-2?osq=code+red&start=%s") % start
    res = requests.get(url)

    if not res.ok:
        break

    soup = BeautifulSoup(res.text, "lxml")
    jsons = (
        json.loads(x.contents[0])
        for x in soup.select('script[type="application/ld+json"]')
    )

    try:
        restaurant_data = next(
            x for x in jsons if "@type" in x and x["@type"] == "Restaurant"
        )
    except StopIteration:
        break

    all_reviews.extend(restaurant_data["review"])

print(json.dumps(all_reviews, indent=2))
print(len(all_reviews))

如果您想加快速度，您可以使用 asyncio 或線程，如Python 從頁面上的鏈接下載多個文件中所述。

不用說，但如果 Yelp 更改其 JSON 格式，代碼就會中斷。 另一個可能更可靠的選擇是使用像Pyppeteer或Selenium這樣的實時抓取工具（但是如果 CSS 選擇器發生變化，事情就會中斷，所以選擇你的毒葯）。

如何使用 Python beautifulsoup 將 yelp 評論和星級評分刮到 CSV 中

問題描述

1 個解決方案

解決方案1
0 2021-11-02 20:31:54

如何使用 Python beautifulsoup 將 yelp 評論和星級評分刮到 CSV 中

問題描述

1 個解決方案

解決方案1 0 2021-11-02 20:31:54

解決方案1
0 2021-11-02 20:31:54