I want to scrape a website and its sub-pages, but it is taking too long. How can I optimize the request or use an alternative solution?
Below is the code I am using. It takes 10s for just loading the Google home page. So it's clearly not scalable if I were to give it 280 links
from selenium import webdriver
import time
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument('headless')
# start chrome browser
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver" ,chrome_options=options)
start=time.time()
browser.get('http://www.google.com/xhtml')
print(time.time()-start)
browser.quit()
Use python requests
and Beautiful soup
module.
import requests
from bs4 import BeautifulSoup
url="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url,verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])
letters=['B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
for letter in letters:
req = requests.get(url1.format(letter), verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print('Letters : ' + letter)
print([item['href'] for item in soup.select('.columns-list a[href]')])
you can use that script for the speed. multithread crawler better than all:
https://edmundmartin.com/multi-threaded-crawler-in-python/
After that you must change that code:
def run_scraper(self):
with open("francais-arabe-marocain.csv", 'a') as file:
file.write("url")
file.writelines("\n")
for i in range(50000):
try:
target_url = self.to_crawl.get(timeout=600)
if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
df = pd.DataFrame([{'url': target_url}])
df.to_csv(file, index=False, header=False)
print(target_url)
except Empty:
return
except Exception as e:
print(e)
continue
If url include "francais-arabe-marocain" save urls in a csv file.
After that you can scrape that urls in one for loop reading csv line by line with same way
try to use urllib just like this
import urllib.request
start=time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time()-start)
it took only 2s. However, it depends also on the quality of connection you have
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.