Why does my scraping code works with all pages except this one?

Question

I am scraping reviews from a website using python and BeautifulSoup. The code below works for scraping reviews for all companies in my sample, except McDonald's. When I try the code below, I get len_review = 0.

Any idea what might cause the problem?

Thanks!

# -*- coding: utf-8 -*-

#Python3.x
import urllib
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
import csv, re, sys, unicodedata

csvfile=open('indeed_scrape.csv', 'w', encoding='utf-8', errors='replace')
writer=csv.writer(csvfile)

list_url= ["https://www.indeed.com/cmp/McDonald's/reviews?fcountry=US"]


for url in list_url:
 base_url_parts = urllib.parse.urlparse(url)
 while True:
    raw_html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(raw_html,"lxml")

    review_tag={'class':re.compile("cmp-Review-content")}
    reviews=soup.find_all(attrs=review_tag)
    len_reviews=len(reviews)

Answer 1

What are some of the other samples in your set that you do get hits from?

I looked at the raw_html and there is no "cmp-Review-content" in the HTML which is why BS4 cant find any.

In [12]: 'cmp-Review-content' in str(raw_html)
Out[12]: False

The raw_html looks to be a giant json dict under reviewsList so you might have to parse out of that instead

Why does my scraping code works with all pages except this one?

Question

1 answers

solution1
0 ACCPTED 2020-12-18 13:51:19

Why does my scraping code works with all pages except this one?

Question

1 answers

solution1 0 ACCPTED 2020-12-18 13:51:19

solution1
0 ACCPTED 2020-12-18 13:51:19