BeautifulSoup 没有刮取所有数据

Question

我想在这个网站上为所有页面（807）刮掉所有法语评论： https://fr.trustpilot.com/review/www.gammvert.fr

共有 16 121 条评论（法语）。

这是我的脚本：

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np


root_url = 'https://fr.trustpilot.com/review/www.gammvert.fr'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,808) ]

comms = []
notes = []

for url in urls: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all('section', class_='review__content')

    for container in commentary:

        try:
            comm  = container.find('p', class_ = 'review-content__text').text.strip()

        except:
            comm = container.find('a', class_ = 'link link--large link--dark').text.strip()

        comms.append(comm)

        note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
        notes.append(note)

data = pd.DataFrame({
    'comms' : comms,
    'notes' : notes
    })

data['comms'] = data['comms'].str.replace('\n', '')


#print(data.head())
data.to_csv('file.csv', sep=';', index=False)

但不幸的是，这个脚本只给了我 7261 条评论，你可以在这里看到： output

而且我不明白为什么我无法获得所有评论？ 该脚本没有给我任何错误，所以我有点迷路了。

有任何想法吗？

谢谢。

Answer 1

您可能会受到网站的“速率限制”，因此在来自同一个 IP 地址的 100 多个呼叫之后，他们开始阻止您并且不发回任何数据。 你的程序没有注意到，因为

for container in commentary:
    # all the rest

没有做任何事情，因为此时的commentary是 = [] 。 您可以通过打印len(commentary)来检查

您可以在网站上查看速率限制是多少，并在循环中相应地添加time.sleep() 。 或者，您可以检查results == '<Response [200]>'否则使用time.sleep(several minutes)来延迟下一个请求调用。

BeautifulSoup 没有刮取所有数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-16 16:39:20

BeautifulSoup 没有刮取所有数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-16 16:39:20

解决方案1
1 已采纳 2020-12-16 16:39:20