简体   繁体   English

Beautifulsoup 在代码中执行时间过长

[英]Beautifulsoup taking too much time to execute in the code

I am trying to scrape a website:- https://media.info/newspapers/titles This website has a list of newspapers from A to Z. I first have to scrape all the URLs and then scrape some more information from each newspaper.我正在尝试抓取一个网站:- https://media.info/newspapers/titles该网站有一个从 A 到 Z 的报纸列表。我首先必须抓取所有 URL,然后从每份报纸上抓取更多信息。

Below is my code to scrape the URLs of all the newspapers starting from A to Z:-以下是我从 A 到 Z 抓取所有报纸的 URL 的代码:-

driver.get('https://media.info/newspapers/titles')
time.sleep(2)
page_title = []
pages = driver.find_elements(By.XPATH,"//div[@class='pages']//a")
for i in pages:
    page_title.append(i.get_attribute("href"))

names = []
for i in page_title:
    driver.get(i)
    time.sleep(1)
    
    name = driver.find_elements(By.XPATH,"//div[@class='info thumbBlock']//a")
    for i in name:
        names.append(i.get_attribute("href"))

len(names):-> 1688 len(名称):-> 1688

names[0:5]
['https://media.info/newspapers/titles/abergavenny-chronicle',
 'https://media.info/newspapers/titles/abergavenny-free-press',
 'https://media.info/newspapers/titles/abergavenny-gazette-diary',
 'https://media.info/newspapers/titles/the-abingdon-herald',
 'https://media.info/newspapers/titles/academies-week']

moving further I need to scrape some information like owner, postal_Address, email, etc and I wrote the below code.进一步移动我需要抓取一些信息,如所有者、postal_Address、email 等,我编写了以下代码。

test = []
c = 0
for i in names:
    driver.get(i)
    time.sleep(2)
    
    r = requests.get(i)
    soup = BeautifulSoup(r.content,'lxml')
    
    try:
        name = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/div[3]/h1").text

        try:
            twitter = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/a").text
        except:
            twitter = None

        try:
            twitter_followers = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/small").text.replace(' followers','').lstrip('(').rstrip(')')
        except:
            twitter_followers = None
            
        people = []
        try:
            persons = driver.find_elements(By.XPATH,"//div[@class='columns']")
            for i in persons:
                people.append(i.text)
        except:
            people.append(None)

        try:
            owner = soup.select_one('th:contains("Owner") + td').text
        except:
            owner = None

        try:
            postal_address = soup.select_one('th:contains("Postal address") + td').text
        except:
            postal_address = None

        try:
            Telephone = soup.select_one('th:contains("Telephone") + td').text
        except:
            Telephone = None

        try:
            company_website = soup.select_one('th:contains("Official website") + td > a').get('href')
        except:
            company_website = None

        try:
            main_email = soup.select_one('th:contains("Main email") + td').text
        except:
            main_email = None

        try:
            personal_email = soup.select_one('th:contains("Personal email") + td').text
        except:
            personal_email = None

        r2 = requests.get(company_website)
        soup2 = BeautifulSoup(r2.content,'lxml')

        try:
            is_wordpress = soup2.find("meta",{"name":"generator"}).get('content')
        except:
            is_wordpress = None

        news_Data = {
                    "Name": name,
                    "Owner": owner,
                    "Postal Address": postal_address,
                    "main Email":main_email,
                    "Telephone": Telephone, 
                    "Personal Email": personal_email,
                    "Company Wesbite": company_website,
                    "Twitter_Handle": twitter,
                    "Twitter_Followers": twitter_followers,
                    "People":people,
                    "Is Wordpress?":is_wordpress
                    }

        test.append(news_Data)
        c=c+1
        print("completed",c)

    except Exception as Argument:
        print(f"There is an exception with {i}")
        pass

I am using both Selenium and BesutifulSoup with requests to scrape the data.我同时使用 Selenium 和 BesutifulSoup 请求抓取数据。 The code is fulfilling the requirements.代码满足要求。

  1. Firstly, is it a good practice to use it in this manner like using selenium and soup in the same code?首先,以这种方式使用它是否是一种好习惯,例如在同一代码中使用 selenium 和汤?
  2. Secondly, the code is taking too much time.其次,代码花费了太多时间。 is there any alternate way to reduce the runtime of the code?有没有其他方法可以减少代码的运行时间?

BeautifulSoup is not slow: making requests and waiting for responses is slow. BeautifulSoup 并不慢:发出请求和等待响应很慢。

You do not necessarily need selenium/chromedriver setup for this task, it's doable with requests (or other python library).您不一定需要为此任务设置 selenium/chromedriver,它可以通过请求(或其他 python 库)来完成。

Yes, there are ways to speed it up, however keep in mind you are making requests to a server, which might become overwhelmed if you send too many requests at once, or it might block you.是的,有一些方法可以加快速度,但是请记住,您正在向服务器发出请求,如果您一次发送太多请求,服务器可能会变得不堪重负,或者它可能会阻止您。

Here is an example without selenium, which will accomplish what you're after:这是一个没有 selenium 的示例,它将完成您所追求的:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    }

s = requests.Session()
s.headers.update(headers)

r = s.get('https://media.info/newspapers/titles')
soup = bs(r.text)
letter_links = [x.get('href') for x in soup.select_one('div.pages').select('a')]
newspaper_links = []
for x in tqdm(letter_links):
    soup = bs(s.get(x).text)
    ns_links = soup.select_one('div.columns').select('a')
    for n in ns_links:
        newspaper_links.append((n.get_text(strip=True), 'https://media.info/' + n.get('href')))

detailed_infos = []
for x in tqdm(newspaper_links[:50]):
    soup = bs(s.get(x[1]).text)
    owner = soup.select_one('th:contains("Owner")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Owner")') else None
    website = soup.select_one('th:contains("Official website")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Official website")') else None

    detailed_infos.append((x[0], x[1], owner, website))
df = pd.DataFrame(detailed_infos, columns = ['Newspaper', 'Info Url', 'Owner', 'Official website'])
print(df)

Result in terminal:结果在终端:

Newspaper   Info Url    Owner   Official website
0   Abergavenny Chronicle   https://media.info//newspapers/titles/abergavenny-chronicle Tindle Newspapers   abergavenny-chronicle-today.co.uk
1   Abergavenny Free Press  https://media.info//newspapers/titles/abergavenny-free-press    Newsquest Media Group   freepressseries.co.uk
2   Abergavenny Gazette & Diary https://media.info//newspapers/titles/abergavenny-gazette-diary Tindle Newspapers   abergavenny-chronicle-today.co.uk/tn/index.cfm
3   The Abingdon Herald https://media.info//newspapers/titles/the-abingdon-herald   Newsquest Media Group   abingdonherald.co.uk
4   Academies Week  https://media.info//newspapers/titles/academies-week    None    academiesweek.co.uk
5   Accrington Observer https://media.info//newspapers/titles/accrington-observer   Reach plc   accringtonobserver.co.uk
6   Addlestone and Byfleet Review   https://media.info//newspapers/titles/addlestone-and-byfleet-review Reach plc   woking.co.uk
7   Admart & North Devon Diary  https://media.info//newspapers/titles/admart-north-devon-diary  Tindle Newspapers   admart.me.uk
8   AdNews Willenhall, Wednesbury and Darlaston https://media.info//newspapers/titles/adnews-willenhall-wednesbury-and-darlaston    Reach plc   reachplc.com
9   The Advertiser  https://media.info//newspapers/titles/the-advertiser    DMGT    dmgt.co.uk
10  Aintree and Maghull Champion    https://media.info//newspapers/titles/aintree-and-maghull-champion  Champion Media group    champnews.com
11  Airdrie & Coatbridge World  https://media.info//newspapers/titles/airdrie-coatbridge-world  Reach plc   icLanarkshire.co.uk
12  Airdrie and Coatbridge Advertiser   https://media.info//newspapers/titles/airdrie-and-coatbridge-advertiser Reach plc   acadvertiser.co.uk
13  Aire Valley Target  https://media.info//newspapers/titles/aire-valley-target    Newsquest Media Group   thisisbradford.co.uk
14  Alcester Chronicle  https://media.info//newspapers/titles/alcester-chronicle    Newsquest Media Group   redditchadvertiser.co.uk/news/alcester
15  Alcester Standard   https://media.info//newspapers/titles/alcester-standard Bullivant Media redditchstandard.co.uk
16  Aldershot Courier   https://media.info//newspapers/titles/aldershot-courier Guardian Media Group    aldershot.co.uk
17  Aldershot Mail  https://media.info//newspapers/titles/aldershot-mail    Guardian Media Group    aldershot.co.uk
18  Aldershot News & Mail   https://media.info//newspapers/titles/aldershot-news-mail   Reach plc   gethampshire.co.uk/aldershot
19  Alford Standard https://media.info//newspapers/titles/alford-standard   JPI Media   skegnessstandard.co.uk
20  Alford Target   https://media.info//newspapers/titles/alford-target DMGT    dmgt.co.uk
21  Alfreton and Ripley Echo    https://media.info//newspapers/titles/alfreton-and-ripley-echo  JPI Media   jpimedia.co.uk
22  Alfreton Chad   https://media.info//newspapers/titles/alfreton-chad JPI Media   chad.co.uk
23  All at Sea  https://media.info//newspapers/titles/all-at-sea    None    allatsea.co.uk
24  Allanwater News https://media.info//newspapers/titles/allanwater-news   HUB Media   allanwaternews.co.uk
25  Alloa & Hillfoots Shopper   https://media.info//newspapers/titles/alloa-hillfoots-shopper   Reach plc   reachplc.com
26  Alloa & Hillfoots Advertiser    https://media.info//newspapers/titles/alloa-hillfoots-advertiser    Dunfermline Press Group alloaadvertiser.com
27  Alloa and Hillfoots Wee County News https://media.info//newspapers/titles/alloa-and-hillfoots-wee-county-news   HUB Media   wee-county-news.co.uk
28  Alton Diary https://media.info//newspapers/titles/alton-diary   Tindle Newspapers   tindlenews.co.uk
29  Andersonstown News  https://media.info//newspapers/titles/andersonstown-news    Belfast Media Group irelandclick.com
30  Andover Advertiser  https://media.info//newspapers/titles/andover-advertiser    Newsquest Media Group   andoveradvertiser.co.uk
31  Anfield and Walton Star https://media.info//newspapers/titles/anfield-and-walton-star   Reach plc   icliverpool.co.uk
32  The Anglo-Celt  https://media.info//newspapers/titles/the-anglo-celt    None    anglocelt.ie
33  Annandale Herald    https://media.info//newspapers/titles/annandale-herald  Dumfriesshire Newspaper Group   dng24.co.uk
34  Annandale Observer  https://media.info//newspapers/titles/annandale-observer    Dumfriesshire Newspaper Group   dng24.co.uk
35  Antrim Times    https://media.info//newspapers/titles/antrim-times  JPI Media   antrimtoday.co.uk
36  Arbroath Herald https://media.info//newspapers/titles/arbroath-herald   JPI Media   arbroathherald.com
37  The Arden Observer  https://media.info//newspapers/titles/the-arden-observer    Bullivant Media ardenobserver.co.uk
38  Ardrossan & Saltcoats Herald    https://media.info//newspapers/titles/ardrossan-saltcoats-herald    Newsquest Media Group   ardrossanherald.com
39  The Argus   https://media.info//newspapers/titles/the-argus Newsquest Media Group   theargus.co.uk
40  Argyllshire Advertiser  https://media.info//newspapers/titles/argyllshire-advertiser    Oban Times Group    argyllshireadvertiser.co.uk
41  Armthorpe Community Newsletter  https://media.info//newspapers/titles/armthorpe-community-newsletter    JPI Media   jpimedia.co.uk
42  The Arran Banner    https://media.info//newspapers/titles/the-arran-banner  Oban Times Group    arranbanner.co.uk
43  The Arran Voice https://media.info//newspapers/titles/the-arran-voice   Independent News Ltd    voiceforarran.com
44  The Art Newspaper   https://media.info//newspapers/titles/the-art-newspaper None    theartnewspaper.com
45  Ashbourne News Telegraph    https://media.info//newspapers/titles/ashbourne-news-telegraph  Reach plc   ashbournenewstelegraph.co.uk
46  Ashby Echo  https://media.info//newspapers/titles/ashby-echo    Reach plc   reachplc.com
47  Ashby Mail  https://media.info//newspapers/titles/ashby-mail    DMGT    thisisleicestershire.co.uk
48  Ashfield Chad   https://media.info//newspapers/titles/ashfield-chad JPI Media   chad.co.uk
49  Ashford Adscene https://media.info//newspapers/titles/ashford-adscene   DMGT    thisiskent.co.uk

You can extract more information for each newspaper, as you wish - the above is just an example, going through the first 50 newspapers.您可以根据需要为每份报纸提取更多信息 - 以上只是一个示例,通过前 50 份报纸。 Now if you want a multithreaded/async solution, I recommend you read the following, and apply it to your own scenario: BeautifulSoup getting href of a list - need to simplify the script - replace multiprocessing现在,如果您想要一个多线程/异步解决方案,我建议您阅读以下内容,并将其应用于您自己的场景: BeautifulSoup 获取列表的 href - 需要简化脚本 - 替换多处理

Lastly, Requests docs can be found here: https://requests.readthedocs.io/en/latest/最后,可以在此处找到请求文档: https://requests.readthedocs.io/en/latest/

BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html BeautifulSoup 文档: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

For TQDM: https://pypi.org/project/tqdm/对于 TQDM: https://pypi.org/project/tqdm/

names = []
for letter in string.ascii_lowercase:
    page = requests.get("https://media.info/newspapers/titles/starting-with/{}".format(letter))
    soup = BeautifulSoup(page.content, "html.parser")
    for i in soup.find_all("a"):
        if i['href'].startswith("/newspapers/titles/"):
            names.append(i['href'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM