简体   繁体   中英

Why does my scraping code works with all pages except this one?

I am scraping reviews from a website using python and BeautifulSoup. The code below works for scraping reviews for all companies in my sample, except McDonald's. When I try the code below, I get len_review = 0.

Any idea what might cause the problem?

Thanks!

# -*- coding: utf-8 -*-

#Python3.x
import urllib
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
import csv, re, sys, unicodedata

csvfile=open('indeed_scrape.csv', 'w', encoding='utf-8', errors='replace')
writer=csv.writer(csvfile)

list_url= ["https://www.indeed.com/cmp/McDonald's/reviews?fcountry=US"]


for url in list_url:
 base_url_parts = urllib.parse.urlparse(url)
 while True:
    raw_html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(raw_html,"lxml")

    review_tag={'class':re.compile("cmp-Review-content")}
    reviews=soup.find_all(attrs=review_tag)
    len_reviews=len(reviews)

What are some of the other samples in your set that you do get hits from?

I looked at the raw_html and there is no "cmp-Review-content" in the HTML which is why BS4 cant find any.

In [12]: 'cmp-Review-content' in str(raw_html)
Out[12]: False

The raw_html looks to be a giant json dict under reviewsList so you might have to parse out of that instead

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM