简体   繁体   中英

BeautifulSoup doesn't scrape all the data

I want to scrape all the comments in french on this website for all the pages (807): https://fr.trustpilot.com/review/www.gammvert.fr

There are 16 121 comments in total (in french).

Here's my script:

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np


root_url = 'https://fr.trustpilot.com/review/www.gammvert.fr'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,808) ]

comms = []
notes = []

for url in urls: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all('section', class_='review__content')

    for container in commentary:

        try:
            comm  = container.find('p', class_ = 'review-content__text').text.strip()

        except:
            comm = container.find('a', class_ = 'link link--large link--dark').text.strip()

        comms.append(comm)

        note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
        notes.append(note)

data = pd.DataFrame({
    'comms' : comms,
    'notes' : notes
    })

data['comms'] = data['comms'].str.replace('\n', '')


#print(data.head())
data.to_csv('file.csv', sep=';', index=False)

But unfortunately, this script got me only 7261 comments as you could see here: output

And I don't see why I couldn't obtain all the comments? The script doesn't give me any error whatsoever so I'm kind of lost.

Any ideas?

Thanks.

You are likely "rate-limited" by the website, so after 100+ call from the same IP address they start to block you and don't send back any data. Your program doesn't notice that because

for container in commentary:
    # all the rest

doesn't do anything since commentary at this point is = [] . You can check that by printing len(commentary)

You can check on the website what the rate limit is and add a time.sleep() accordingly in your loop. Alternatively, you can check that results == '<Response [200]>' otherwise use time.sleep(several minutes) to delay the next request call.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM