[英]Why does my scraping code works with all pages except this one?
I am scraping reviews from a website using python and BeautifulSoup.我正在使用 python 和 BeautifulSoup 从网站上抓取评论。 The code below works for scraping reviews for all companies in my sample, except McDonald's.
下面的代码适用于抓取我的示例中所有公司的评论,除了麦当劳。 When I try the code below, I get len_review = 0.
当我尝试下面的代码时,我得到 len_review = 0。
Any idea what might cause the problem?知道什么可能导致问题吗?
Thanks!谢谢!
# -*- coding: utf-8 -*-
#Python3.x
import urllib
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
import csv, re, sys, unicodedata
csvfile=open('indeed_scrape.csv', 'w', encoding='utf-8', errors='replace')
writer=csv.writer(csvfile)
list_url= ["https://www.indeed.com/cmp/McDonald's/reviews?fcountry=US"]
for url in list_url:
base_url_parts = urllib.parse.urlparse(url)
while True:
raw_html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(raw_html,"lxml")
review_tag={'class':re.compile("cmp-Review-content")}
reviews=soup.find_all(attrs=review_tag)
len_reviews=len(reviews)
What are some of the other samples in your set that you do get hits from?你的集合中还有哪些其他样本让你获得了成功?
I looked at the raw_html and there is no "cmp-Review-content" in the HTML which is why BS4 cant find any.我查看了 raw_html 并且 HTML 中没有“cmp-Review-content”,这就是 BS4 找不到任何内容的原因。
In [12]: 'cmp-Review-content' in str(raw_html)
Out[12]: False
The raw_html looks to be a giant json dict under reviewsList so you might have to parse out of that instead raw_html 看起来是 reviewsList 下的一个巨大的 json 字典所以你可能不得不解析出来
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.