简体   繁体   English

使用 Selenium、BeautifulSoup 抓取网页耗时过长

[英]Scraping a webpage taking too long using Selenium, BeautifulSoup

I want to scrape a website and its sub-pages, but it is taking too long.我想抓取一个网站及其子页面,但花费的时间太长。 How can I optimize the request or use an alternative solution?如何优化请求或使用替代解决方案?

Below is the code I am using.下面是我正在使用的代码。 It takes 10s for just loading the Google home page.加载谷歌主页需要 10 秒。 So it's clearly not scalable if I were to give it 280 links所以如果我给它 280 个链接,它显然是不可扩展的

from selenium import webdriver
import time
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument('headless')

# start chrome browser
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver" ,chrome_options=options)
start=time.time()
browser.get('http://www.google.com/xhtml')
print(time.time()-start)
browser.quit()

Use python requests and Beautiful soup module.使用 python requestsBeautiful soup模块。

import requests
from bs4 import BeautifulSoup
url="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url,verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])

letters=['B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']

for letter in letters:

    req = requests.get(url1.format(letter), verify=False)
    soup = BeautifulSoup(req.text, 'html.parser')
    print('Letters : ' + letter)
    print([item['href'] for item in soup.select('.columns-list a[href]')])

you can use that script for the speed.您可以使用该脚本来提高速度。 multithread crawler better than all:多线程爬虫胜过一切:

https://edmundmartin.com/multi-threaded-crawler-in-python/ https://edmundmartin.com/multi-threaded-crawler-in-python/

After that you must change that code:之后,您必须更改该代码:

def run_scraper(self):
    with open("francais-arabe-marocain.csv", 'a') as file:
        file.write("url")
        file.writelines("\n")
        for i in range(50000):
            try:
                target_url = self.to_crawl.get(timeout=600)
                if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)
                    df = pd.DataFrame([{'url': target_url}])
                    df.to_csv(file, index=False, header=False)
                    print(target_url)
            except Empty:
                return
            except Exception as e:
                print(e)
                continue

If url include "francais-arabe-marocain" save urls in a csv file.如果 url 包含“francais-arabe-marocain”,则将 url 保存在 csv 文件中。 CSV 文件截图

After that you can scrape that urls in one for loop reading csv line by line with same way之后,您可以在一个循环中以相同的方式逐行读取 csv

try to use urllib just like this尝试像这样使用 urllib

import urllib.request
start=time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time()-start)

it took only 2s.只用了2s。 However, it depends also on the quality of connection you have但是,这也取决于您的连接质量

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM