使用 Selenium、BeautifulSoup 抓取网页耗时过长

Question

I want to scrape a website and its sub-pages, but it is taking too long.我想抓取一个网站及其子页面，但花费的时间太长。 How can I optimize the request or use an alternative solution?如何优化请求或使用替代解决方案？

Below is the code I am using.下面是我正在使用的代码。 It takes 10s for just loading the Google home page.加载谷歌主页需要 10 秒。 So it's clearly not scalable if I were to give it 280 links所以如果我给它 280 个链接，它显然是不可扩展的

from selenium import webdriver
import time
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument('headless')

# start chrome browser
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver" ,chrome_options=options)
start=time.time()
browser.get('http://www.google.com/xhtml')
print(time.time()-start)
browser.quit()

Answer 1

Use python requests and Beautiful soup module.使用 python requests和Beautiful soup模块。

import requests
from bs4 import BeautifulSoup
url="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url,verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])

letters=['B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']

for letter in letters:

    req = requests.get(url1.format(letter), verify=False)
    soup = BeautifulSoup(req.text, 'html.parser')
    print('Letters : ' + letter)
    print([item['href'] for item in soup.select('.columns-list a[href]')])

Answer 2

you can use that script for the speed.您可以使用该脚本来提高速度。 multithread crawler better than all:多线程爬虫胜过一切：

https://edmundmartin.com/multi-threaded-crawler-in-python/ https://edmundmartin.com/multi-threaded-crawler-in-python/

After that you must change that code:之后，您必须更改该代码：

def run_scraper(self):
    with open("francais-arabe-marocain.csv", 'a') as file:
        file.write("url")
        file.writelines("\n")
        for i in range(50000):
            try:
                target_url = self.to_crawl.get(timeout=600)
                if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)
                    df = pd.DataFrame([{'url': target_url}])
                    df.to_csv(file, index=False, header=False)
                    print(target_url)
            except Empty:
                return
            except Exception as e:
                print(e)
                continue

If url include "francais-arabe-marocain" save urls in a csv file.如果 url 包含“francais-arabe-marocain”，则将 url 保存在 csv 文件中。

After that you can scrape that urls in one for loop reading csv line by line with same way之后，您可以在一个循环中以相同的方式逐行读取 csv

Answer 3

try to use urllib just like this尝试像这样使用 urllib

import urllib.request
start=time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time()-start)

it took only 2s.只用了2s。 However, it depends also on the quality of connection you have但是，这也取决于您的连接质量

使用 Selenium、BeautifulSoup 抓取网页耗时过长

问题描述

3 个解决方案

解决方案1
2 2020-01-03 11:44:53

解决方案2
2 2020-01-03 12:24:04

解决方案3
0 2020-01-03 11:24:24

使用 Selenium、BeautifulSoup 抓取网页耗时过长

问题描述

3 个解决方案

解决方案1 2 2020-01-03 11:44:53

解决方案2 2 2020-01-03 12:24:04

解决方案3 0 2020-01-03 11:24:24

解决方案1
2 2020-01-03 11:44:53

解决方案2
2 2020-01-03 12:24:04

解决方案3
0 2020-01-03 11:24:24