簡體   English   中英

抓取 google play 評論

[英]Scraping google play reviews

我是編程新手,最近我嘗試使用以下程序通過 python 抓取 google play 評論:

from bs4 import BeautifulSoup
import urllib.request

url = input("Enter URL: ")
open_url = urllib.request.urlopen(url)

soup = BeautifulSoup(open_url, "html.parser")

reviews = []
for i in soup.find_all("div", {"jscontroller" : "X"}, {"class" : "X"}):
    per_review = i.find("X")
    reviews.append(per_review)

print(reviews)  

問題出在本節中:

for i in soup.find_all("div", {"jscontroller" : "X"}, {"class" : "X"}):
    per_review = i.find("X")
    reviews.append(per_review) 

我嘗試過許多父節點和包含評論的當前節點,但 output 始終是一個空列表。 有人可以演示如何實現我的意圖嗎? 謝謝。

編輯

例如,如果我將 URL 用於超級馬里奧酷跑,參數如下:

reviews = []
for i in soup.find_all("div", {"jscontroller" : "LVJlx"}, {"class" : "UD7Dzf"}):
    per_review = i.find("span")
    reviews.append(per_review)

print(reviews)    

output 是一個空列表。

jscontrollerclass值在不同的 URL 中將不一致。 你可以嘗試類似的東西

soup.find_all('div', {'jscontroller': True}) 

但這不會為您提供所有評論,因為它們是在您向下滾動頁面時動態添加的。

這意味着您需要使用實際瀏覽器抓取頁面,或者您可以嘗試使用開發工具對 API 調用進行反向工程。

例如

在此處輸入圖像描述

您可以通過正則表達式解析來自內聯<script>標簽的評論數據。 然后使用正則表達式解析用戶名、頭像、評論等。

通過正則表達式解析下一頁標記並發出 POST 請求而不是 GET,以相同的方式完成分頁。

網絡選項卡:

在此處輸入圖像描述

頁面源,在<script>標記中的某處,您會看到相同的頁面標記:

在此處輸入圖像描述


在線 IDE 中用於抓取前 40 條評論的代碼和示例

# everything here could be refactored to look more simplifed

from bs4 import BeautifulSoup
import requests, lxml, re, json
from datetime import datetime

# user-agent headers to act as a "real" user visit
headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

# search query params
params = {
    "id": "com.nintendo.zara",  # app name
    "gl": "ES"                  # country
}


html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

# temporary store user comments
app_user_comments = []

# https://regex101.com/r/SrP5DS/1
app_user_reviews_data = re.findall(r"(\[\"gp.*?);</script>",
                                str(soup.select("script")), re.DOTALL)

for review in app_user_reviews_data:
    # https://regex101.com/r/M24tiM/1
    user_name = re.findall(r"\"gp:.*?\",\s?\[\"(.*?)\",", str(review))
    
    # https://regex101.com/r/TGgR45/1
    user_avatar = [avatar.replace('"', "") for avatar in re.findall(r"\"gp:.*?\"(https.*?\")", str(review))]

    # replace single/double quotes at the start/end of a string
    # https://regex101.com/r/iHPOrI/1
    user_comments = [comment.replace('"', "").replace("'", "") for comment in
                    re.findall(r"gp:.*?https:.*?]]],\s?\d+?,.*?,\s?(.*?),\s?\[\d+,", str(review))]

    # https://regex101.com/r/Z7vFqa/1
    user_comment_app_rating = re.findall(r"\"gp.*?https.*?\],(.*?)?,", str(review))
    
    # https://regex101.com/r/jRaaQg/1
    user_comment_likes = re.findall(r",?\d+\],?(\d+),?", str(review))
    
    # comment utc timestamp
    # use datetime.utcfromtimestamp(int(date)).date() to have only a date
    user_comment_date = [str(datetime.utcfromtimestamp(int(date))) for date in re.findall(r"\[(\d+),", str(review))]
    
    # https://regex101.com/r/GrbH9A/1
    user_comment_id = [ids.replace('"', "") for ids in re.findall(r"\[\"(gp.*?),", str(review))]
    
    for index, (name, avatar, comment, date, comment_id, likes, user_app_rating) in enumerate(zip(
        user_name,
        user_avatar,
        user_comments,
        user_comment_date,
        user_comment_id,
        user_comment_likes,
        user_comment_app_rating), start=1):

        app_user_comments.append({
            "position": index,
            "name": name,
            "avatar": avatar,
            "comment": comment,
            "app_rating": user_app_rating,
            "comment_likes": likes,
            "comment_published_at": date,
            "comment_id": comment_id
        })
        
print(json.dumps(app_user_comments, indent=2, ensure_ascii=False))

output的一部分:

]
  {
    "position": 1,
    "user_name": "cohen rigg",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
    "comment": "This game is a good game. Its fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
    "app_rating": "2",
    "comment_likes": "22",
    "comment_published_at": "2022-04-16 09:01:36",
    "comment_id": "gp:AOqpTOGljqOaIofAehHYtGN2ay-hnYEigfYD4hgzPoLseth5l-BzPn-RaIShuKzakplra0V1E3KJIu-AfsG5mA"
  }, ... other results
  {
    "position": 40,
    "user_name": "Claire Barrett",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14GgeG3YaXc7tvjnl7kom2vYaTm4lXwS8UEDiZZV4BA",
    "comment": "After purchasing this is a super fun game, the game modes are fun and is a well executed idea. That is if you can get a \\good\\ network connection. Consistently when playing remix 10 I will get error after error and need to close the game, swipe it out of my recents, open it and wait for it to load all over again. A minor detail on top of this is the game is incredibly loud, I cant listen with music because the sound effects completely cover any music I put on.",
    "app_rating": "3",
    "comment_likes": "28",
    "comment_published_at": "2019-01-05 07:01:35",
    "comment_id": "gp:AOqpTOEOm_ilgrrHynfDLHvEusMHgvXtlwjSY-7SHBxH1Z-jgQQF62TRcFU4TQBQsFBaN1hNid3-yufUOV4IcQ"
  }
]

或者,您可以使用來自 SerpApi 的Google Play 產品 API來完成。 這是帶有免費計划的付費 API。

不同之處在於,您不必弄清楚如何解析數據,然后隨着時間的推移維護解析器。 弄清楚如何繞過 Google 的阻止或弄清楚如何實現分頁。 看看操場

集成以解析前 40 個結果的示例代碼:

from serpapi import GoogleSearch
import json

params = {
  "api_key": "API_KEY",                # your serpapi api key
  "engine": "google_play_product",     # search engine
  "store": "apps",                     
  "gl": "es",                          # country to search from: Spain
  "product_id": "com.nintendo.zara",   # app ID
  "all_reviews": "true"                # show all reviews
}

search = GoogleSearch(params)          # where data extraction happens
results = search.get_dict()            # JSON -> Python dict

for review in results["reviews"]:
    print(json.dumps(review, indent=2))

output的一部分:

{
  "title": "cohen rigg",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
  "rating": 2.0,
  "snippet": "This game is a good game. It's fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
  "likes": 22,
  "date": "April 16, 2022"
} ... other results
{
  "title": "Claire Barrett",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14GgeG3YaXc7tvjnl7kom2vYaTm4lXwS8UEDiZZV4BA",
  "rating": 3.0,
  "snippet": "After purchasing this is a super fun game, the game modes are fun and is a well executed idea. That is if you can get a \"good\" network connection. Consistently when playing remix 10 I will get error after error and need to close the game, swipe it out of my recents, open it and wait for it to load all over again. A minor detail on top of this is the game is incredibly loud, I can't listen with music because the sound effects completely cover any music I put on.",
  "likes": 28,
}

要實現分頁,您可以這樣做:

from serpapi import GoogleSearch
import json
from urllib.parse import urlsplit, parse_qsl


params = {
  "api_key": "API_KEY",                # your serpapi api key
  "engine": "google_play_product",     # search engine
  "store": "apps",                     
  "gl": "es",                          # country to search from: Spain
  "product_id": "com.nintendo.zara",   # app ID
  "all_reviews": "true"                # show all reviews
}

search = GoogleSearch(params)

# just to track what page is currently parsed
index = 0

reviews_is_present = True
while reviews_is_present:
    results = search.get_dict()        # JSON -> Python dict

    # update page number
    index += 1
    for review in results.get("reviews", []):
        
        print(f"\npage #: {index}\n")
        print(json.dumps(review, indent=2))
        
    if "next" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
    else:
        reviews_is_present = False

output的一部分:


page #: 1

{
  "title": "cohen rigg",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
  "rating": 2.0,
  "snippet": "This game is a good game. It's fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
  "likes": 22,
  "date": "April 16, 2022"
} ... other results

page #: 3

{
  "title": "Abbas Katebi",
  "avatar": "https://play-lh.googleusercontent.com/a/AATXAJx8y5Om_FMp3cpzCcQFlgSE7BYngAM6xtyZDuME=mo",
  "rating": 1.0,
  "snippet": "I purchased the game but the restore purchase button doesn't work and it says you have no content can be restored I have been trying to play world 2 for 8 days but still can't access to the full game I have been sending inquiries for 8 days but every time I sent inquiries they said restart the app how many times should I say I restarted the app for many times and it doesn't work solve my problem or give my money back I wonder why I bought the game it's support doesn't care about its customers",
  "likes": 29,
  "date": "March 10, 2022"
} ... other results

免責聲明,我為 SerpApi 工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM