为什么我的抓取代码适用于除此页面之外的所有页面？

Question

I am scraping reviews from a website using python and BeautifulSoup.我正在使用 python 和 BeautifulSoup 从网站上抓取评论。 The code below works for scraping reviews for all companies in my sample, except McDonald's.下面的代码适用于抓取我的示例中所有公司的评论，除了麦当劳。 When I try the code below, I get len_review = 0.当我尝试下面的代码时，我得到 len_review = 0。

Any idea what might cause the problem?知道什么可能导致问题吗？

Thanks!谢谢！

# -*- coding: utf-8 -*-

#Python3.x
import urllib
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
import csv, re, sys, unicodedata

csvfile=open('indeed_scrape.csv', 'w', encoding='utf-8', errors='replace')
writer=csv.writer(csvfile)

list_url= ["https://www.indeed.com/cmp/McDonald's/reviews?fcountry=US"]


for url in list_url:
 base_url_parts = urllib.parse.urlparse(url)
 while True:
    raw_html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(raw_html,"lxml")

    review_tag={'class':re.compile("cmp-Review-content")}
    reviews=soup.find_all(attrs=review_tag)
    len_reviews=len(reviews)

Answer 1

What are some of the other samples in your set that you do get hits from?你的集合中还有哪些其他样本让你获得了成功？

I looked at the raw_html and there is no "cmp-Review-content" in the HTML which is why BS4 cant find any.我查看了 raw_html 并且 HTML 中没有“cmp-Review-content”，这就是 BS4 找不到任何内容的原因。

In [12]: 'cmp-Review-content' in str(raw_html)
Out[12]: False

The raw_html looks to be a giant json dict under reviewsList so you might have to parse out of that instead raw_html 看起来是 reviewsList 下的一个巨大的 json 字典所以你可能不得不解析出来

为什么我的抓取代码适用于除此页面之外的所有页面？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-12-18 13:51:19

为什么我的抓取代码适用于除此页面之外的所有页面？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-12-18 13:51:19

解决方案1
0 已采纳 2020-12-18 13:51:19