I want to scrape all the comments in french on this website for all the pages (807): https://fr.trustpilot.com/review/www.gammvert.fr
There are 16 121 comments in total (in french).
Here's my script:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
root_url = 'https://fr.trustpilot.com/review/www.gammvert.fr'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,808) ]
comms = []
notes = []
for url in urls:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
commentary = soup.find_all('section', class_='review__content')
for container in commentary:
try:
comm = container.find('p', class_ = 'review-content__text').text.strip()
except:
comm = container.find('a', class_ = 'link link--large link--dark').text.strip()
comms.append(comm)
note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
notes.append(note)
data = pd.DataFrame({
'comms' : comms,
'notes' : notes
})
data['comms'] = data['comms'].str.replace('\n', '')
#print(data.head())
data.to_csv('file.csv', sep=';', index=False)
But unfortunately, this script got me only 7261 comments as you could see here: output
And I don't see why I couldn't obtain all the comments? The script doesn't give me any error whatsoever so I'm kind of lost.
Any ideas?
Thanks.
You are likely "rate-limited" by the website, so after 100+ call from the same IP address they start to block you and don't send back any data. Your program doesn't notice that because
for container in commentary:
# all the rest
doesn't do anything since commentary
at this point is = []
. You can check that by printing len(commentary)
You can check on the website what the rate limit is and add a time.sleep()
accordingly in your loop. Alternatively, you can check that results == '<Response [200]>'
otherwise use time.sleep(several minutes)
to delay the next request call.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.