简体   繁体   中英

How to scrape star ratings using Selenium or Beautifulsoup in Python?

I am trying to scrape the ratings based on stars. The stars are in different colors and can be distinguished in Chrome. However, the stars are all the same in the tags. Is there a way to scrape the rating for each sub-category based on the color of the stars, eg, Work/Life Balance should have a rating of 3.

The webpage can be found here: https://www.glassdoor.ca/Reviews/Employee-Review-AAR-RVW40036525.htm

在此处输入图像描述
在此处输入图像描述

To differentiate between the ratings, the class names are different for each category of rating. Here is an example of all the class names based on rating, the value is the class name. This can start you off with what you need

{
"one_star" : "css-152xdkl",
"two_star" : "css-19o85uz",
"three_star" : "css-1ihykkv",
"four_star" : "css-1c07csa",
"five_star" : "css-1dc0bv4",
}

This is what I did. I used mostly BeautifulSoup because I'm more comfortable with it.

# Find all the reviews on the page
reviews = driver.find_elements_by_class_name('gdReview')

# I used BeautifulSoup to collect the ratings
for review in reviews:
    # Convert the Selenium element for a review into a BeautifulSoup object
    review_source = review.get_attribute('innerHTML')
    soup = BeautifulSoup(review_source, 'lxml')

    # Find the sub-ratings tag
    sub_ratings_tag = soup.find("div", {"class": "tooltipContainer"})
    # Find all the "li" tags
    li_tags = sub_ratings_tag.find_all("li")

    # Loop over each "li" tag and collect the ratings
    star_dict = {"css-152xdkl": 1, "css-19o85uz": 2, "css-1ihykkv": 3,
                 "css-1c07csa": 4, "css-1dc0bv4": 5}
    sub_rating_dict = {}
    for li_tag in li_tags:
        div_tags = li_tag.find_all("div")
        for div_tag in div_tags:                
            # Get the classname and the rating name
            if div_tag.has_attr("class"):
                div_class=div_tag["class"][0]
            else:
                sub_cat = div_tag.text.strip()
        star_value = star_dict[div_class]
        sub_rating_dict[sub_cat] = star_value

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM