简体   繁体   中英

Python 3.6, web crawler using Selenium, cannot grab the dates and ratings of user reviews from Google Play Store

I am working on a project of web crawler using Selenium with Python 3.6 in a Jupyter Notebook .

The goal is to grab the reviews and their corresponding dates and ratings of an APP.

The target webpage is

https://play.google.com/store/apps/details?id=io.silvrr.installment&hl=en_US&gl=US&showAllReviews=true .

在此处输入图像描述

I can get the reviews but I failed to grab their dates and ratings.

The code I used to grab the reviews is shown below:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = 'https://play.google.com/store/apps/details?id=io.silvrr.installment&hl=en_US&gl=US&showAllReviews=true' 
path = r"D:\Term 1\Cloud Computing\Session 4-5\CC_ETL\chromedriver"
browser = webdriver.Chrome(path) #Obrir un navegador Chrome
browser.get(url)

content = browser.find_elements_by_class_name('UD7Dzf')
User_Review = []  #Create an empty list to store the user reviews

for i in (range(0,40)):  
    first = {'Review -{}'.format(i+1):content[i].text}
    User_Review.append(first)
User_Review

I then tried to grab the date with the class name "p2TkOb" :

date = browser.find_elements_by_class_name('p2TkOb')

but it failed to return the specific dates. Instead, it returned some web elements shown below:

[<selenium.webdriver.remote.webelement.WebElement (session="b0e4c4f85a982c03a64302931a3474d1", element="dd95dfb2-c8da-47fc-8d5e-16a0ce010db7")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b0e4c4f85a982c03a64302931a3474d1", element="7fd1ab9a-b654-4ec0-b6c7-e8bd5260d0db")>,

Also, there two kinds of dates, one of the users and another for the developer, whose class names are the same. However, I only aim to grab the dates of user reviews.

在此处输入图像描述

I also got troubled in locating the element of ratings, for example, div aria-label="Rated 1 stars out of five stars" .

Can anybody please help me? Thanks a lot!

It is not good idea to search separatelly content and separatelly date .

You should get div which keeps both content and date and later use relative find_element_by_... to get content and date in this div . This way you get date for this content and you can control it.

I use 'd15Mdf bAhLNe' to get divs which keep both content and date (every div groups all in one review). And later I search content and date inside every div separatelly - using item.find... instead of browser.find... - and I get single content and singel date and I'm sure that this date is for this content .


Code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = 'https://play.google.com/store/apps/details?id=io.silvrr.installment&hl=en_US&gl=US&showAllReviews=true' 

path = r"D:\Term 1\Cloud Computing\Session 4-5\CC_ETL\chromedriver"
#browser = webdriver.Chrome(path) #Obrir un navegador Chrome
browser = webdriver.Firefox()

browser.get(url)

cards = browser.find_elements_by_class_name('d15Mdf.bAhLNe')  # it has to be `dot` in place of `space`

user_reviews = []  # Create an empty list to store the user reviews

for number, item in enumerate(cards, 1):  
    review = item.find_element_by_class_name('UD7Dzf').text
    date = item.find_element_by_class_name('p2TkOb').text
    user = item.find_element_by_class_name('X43Kjb').text
    rated = item.find_element_by_xpath('.//div[@class="pf5lIe"]/div').get_attribute('aria-label').split(' ')[1]

    user_reviews.append({
               'number': number, 
               'review': review, 
               'date': date, 
               'user': user,
               'rated': rated,
    })

for item in user_reviews:  
    print('---', item['number'], '---')
    print('date:', item['date'])
    print('user:', item['user'])
    print('rated:', item['rated'])
    print('review:', item['review'][:50], '...')

Result:

--- 1 ---
date: January 9, 2021
user: NICASIO NICOSIA
rated: 1
review: I used to love this app because of the payment ter ...
--- 2 ---
date: February 3, 2021
user: Shitta Soewarno
rated: 1
review: After I updated the application I could no longer  ...
--- 3 ---
date: January 25, 2021
user: Jennifer Jones
rated: 2
review: Used to love this alot and i always pay back on ti ...
--- 4 ---
date: February 5, 2021
user: Noy D Junior
rated: 3
review: Applications that helped out so much. But now I'm  ...
--- 5 ---
date: January 31, 2021
user: Syuhada Husni
rated: 1
review: it suddenly logged me out and now i cannot log in  ...

BTW: Selenium converts find_element_by_class_name('name') to css selector with dot at the beginning - .name - but it has problem with mulit names find_element_by_class_name('name1 name2') . It should put dot before every name and create .name1.name2 but it adds dot only at the beginnig .name1 name2 so I add manually dot between names in

find_elements_by_class_name('d15Mdf.bAhLNe')

BTW: It is not good idea to create unique keys Review-1 , Review-2 because later it is problem to get review - you have to know what number to use in Review-{} . It is better to use the same key review in all items.


BTW: xpath starts with dot ( .//div ) to create relative xpath which searchs only inside item .


I put it on GitHub furas/python-examples

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM