[英]Beautifulsoup taking too much time to execute in the code
I am trying to scrape a website:- https://media.info/newspapers/titles This website has a list of newspapers from A to Z. I first have to scrape all the URLs and then scrape some more information from each newspaper.我正在尝试抓取一个网站:- https://media.info/newspapers/titles该网站有一个从 A 到 Z 的报纸列表。我首先必须抓取所有 URL,然后从每份报纸上抓取更多信息。
Below is my code to scrape the URLs of all the newspapers starting from A to Z:-以下是我从 A 到 Z 抓取所有报纸的 URL 的代码:-
driver.get('https://media.info/newspapers/titles')
time.sleep(2)
page_title = []
pages = driver.find_elements(By.XPATH,"//div[@class='pages']//a")
for i in pages:
page_title.append(i.get_attribute("href"))
names = []
for i in page_title:
driver.get(i)
time.sleep(1)
name = driver.find_elements(By.XPATH,"//div[@class='info thumbBlock']//a")
for i in name:
names.append(i.get_attribute("href"))
len(names):-> 1688 len(名称):-> 1688
names[0:5]
['https://media.info/newspapers/titles/abergavenny-chronicle',
'https://media.info/newspapers/titles/abergavenny-free-press',
'https://media.info/newspapers/titles/abergavenny-gazette-diary',
'https://media.info/newspapers/titles/the-abingdon-herald',
'https://media.info/newspapers/titles/academies-week']
moving further I need to scrape some information like owner, postal_Address, email, etc and I wrote the below code.进一步移动我需要抓取一些信息,如所有者、postal_Address、email 等,我编写了以下代码。
test = []
c = 0
for i in names:
driver.get(i)
time.sleep(2)
r = requests.get(i)
soup = BeautifulSoup(r.content,'lxml')
try:
name = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/div[3]/h1").text
try:
twitter = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/a").text
except:
twitter = None
try:
twitter_followers = driver.find_element(By.XPATH,"//*[@id='mainpage']/article/table[3]/tbody/tr/td[1]/small").text.replace(' followers','').lstrip('(').rstrip(')')
except:
twitter_followers = None
people = []
try:
persons = driver.find_elements(By.XPATH,"//div[@class='columns']")
for i in persons:
people.append(i.text)
except:
people.append(None)
try:
owner = soup.select_one('th:contains("Owner") + td').text
except:
owner = None
try:
postal_address = soup.select_one('th:contains("Postal address") + td').text
except:
postal_address = None
try:
Telephone = soup.select_one('th:contains("Telephone") + td').text
except:
Telephone = None
try:
company_website = soup.select_one('th:contains("Official website") + td > a').get('href')
except:
company_website = None
try:
main_email = soup.select_one('th:contains("Main email") + td').text
except:
main_email = None
try:
personal_email = soup.select_one('th:contains("Personal email") + td').text
except:
personal_email = None
r2 = requests.get(company_website)
soup2 = BeautifulSoup(r2.content,'lxml')
try:
is_wordpress = soup2.find("meta",{"name":"generator"}).get('content')
except:
is_wordpress = None
news_Data = {
"Name": name,
"Owner": owner,
"Postal Address": postal_address,
"main Email":main_email,
"Telephone": Telephone,
"Personal Email": personal_email,
"Company Wesbite": company_website,
"Twitter_Handle": twitter,
"Twitter_Followers": twitter_followers,
"People":people,
"Is Wordpress?":is_wordpress
}
test.append(news_Data)
c=c+1
print("completed",c)
except Exception as Argument:
print(f"There is an exception with {i}")
pass
I am using both Selenium and BesutifulSoup with requests to scrape the data.我同时使用 Selenium 和 BesutifulSoup 请求抓取数据。 The code is fulfilling the requirements.
代码满足要求。
BeautifulSoup is not slow: making requests and waiting for responses is slow. BeautifulSoup 并不慢:发出请求和等待响应很慢。
You do not necessarily need selenium/chromedriver setup for this task, it's doable with requests (or other python library).您不一定需要为此任务设置 selenium/chromedriver,它可以通过请求(或其他 python 库)来完成。
Yes, there are ways to speed it up, however keep in mind you are making requests to a server, which might become overwhelmed if you send too many requests at once, or it might block you.是的,有一些方法可以加快速度,但是请记住,您正在向服务器发出请求,如果您一次发送太多请求,服务器可能会变得不堪重负,或者它可能会阻止您。
Here is an example without selenium, which will accomplish what you're after:这是一个没有 selenium 的示例,它将完成您所追求的:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
r = s.get('https://media.info/newspapers/titles')
soup = bs(r.text)
letter_links = [x.get('href') for x in soup.select_one('div.pages').select('a')]
newspaper_links = []
for x in tqdm(letter_links):
soup = bs(s.get(x).text)
ns_links = soup.select_one('div.columns').select('a')
for n in ns_links:
newspaper_links.append((n.get_text(strip=True), 'https://media.info/' + n.get('href')))
detailed_infos = []
for x in tqdm(newspaper_links[:50]):
soup = bs(s.get(x[1]).text)
owner = soup.select_one('th:contains("Owner")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Owner")') else None
website = soup.select_one('th:contains("Official website")').next_sibling.select_one('a').get_text(strip=True) if soup.select_one('th:contains("Official website")') else None
detailed_infos.append((x[0], x[1], owner, website))
df = pd.DataFrame(detailed_infos, columns = ['Newspaper', 'Info Url', 'Owner', 'Official website'])
print(df)
Result in terminal:结果在终端:
Newspaper Info Url Owner Official website
0 Abergavenny Chronicle https://media.info//newspapers/titles/abergavenny-chronicle Tindle Newspapers abergavenny-chronicle-today.co.uk
1 Abergavenny Free Press https://media.info//newspapers/titles/abergavenny-free-press Newsquest Media Group freepressseries.co.uk
2 Abergavenny Gazette & Diary https://media.info//newspapers/titles/abergavenny-gazette-diary Tindle Newspapers abergavenny-chronicle-today.co.uk/tn/index.cfm
3 The Abingdon Herald https://media.info//newspapers/titles/the-abingdon-herald Newsquest Media Group abingdonherald.co.uk
4 Academies Week https://media.info//newspapers/titles/academies-week None academiesweek.co.uk
5 Accrington Observer https://media.info//newspapers/titles/accrington-observer Reach plc accringtonobserver.co.uk
6 Addlestone and Byfleet Review https://media.info//newspapers/titles/addlestone-and-byfleet-review Reach plc woking.co.uk
7 Admart & North Devon Diary https://media.info//newspapers/titles/admart-north-devon-diary Tindle Newspapers admart.me.uk
8 AdNews Willenhall, Wednesbury and Darlaston https://media.info//newspapers/titles/adnews-willenhall-wednesbury-and-darlaston Reach plc reachplc.com
9 The Advertiser https://media.info//newspapers/titles/the-advertiser DMGT dmgt.co.uk
10 Aintree and Maghull Champion https://media.info//newspapers/titles/aintree-and-maghull-champion Champion Media group champnews.com
11 Airdrie & Coatbridge World https://media.info//newspapers/titles/airdrie-coatbridge-world Reach plc icLanarkshire.co.uk
12 Airdrie and Coatbridge Advertiser https://media.info//newspapers/titles/airdrie-and-coatbridge-advertiser Reach plc acadvertiser.co.uk
13 Aire Valley Target https://media.info//newspapers/titles/aire-valley-target Newsquest Media Group thisisbradford.co.uk
14 Alcester Chronicle https://media.info//newspapers/titles/alcester-chronicle Newsquest Media Group redditchadvertiser.co.uk/news/alcester
15 Alcester Standard https://media.info//newspapers/titles/alcester-standard Bullivant Media redditchstandard.co.uk
16 Aldershot Courier https://media.info//newspapers/titles/aldershot-courier Guardian Media Group aldershot.co.uk
17 Aldershot Mail https://media.info//newspapers/titles/aldershot-mail Guardian Media Group aldershot.co.uk
18 Aldershot News & Mail https://media.info//newspapers/titles/aldershot-news-mail Reach plc gethampshire.co.uk/aldershot
19 Alford Standard https://media.info//newspapers/titles/alford-standard JPI Media skegnessstandard.co.uk
20 Alford Target https://media.info//newspapers/titles/alford-target DMGT dmgt.co.uk
21 Alfreton and Ripley Echo https://media.info//newspapers/titles/alfreton-and-ripley-echo JPI Media jpimedia.co.uk
22 Alfreton Chad https://media.info//newspapers/titles/alfreton-chad JPI Media chad.co.uk
23 All at Sea https://media.info//newspapers/titles/all-at-sea None allatsea.co.uk
24 Allanwater News https://media.info//newspapers/titles/allanwater-news HUB Media allanwaternews.co.uk
25 Alloa & Hillfoots Shopper https://media.info//newspapers/titles/alloa-hillfoots-shopper Reach plc reachplc.com
26 Alloa & Hillfoots Advertiser https://media.info//newspapers/titles/alloa-hillfoots-advertiser Dunfermline Press Group alloaadvertiser.com
27 Alloa and Hillfoots Wee County News https://media.info//newspapers/titles/alloa-and-hillfoots-wee-county-news HUB Media wee-county-news.co.uk
28 Alton Diary https://media.info//newspapers/titles/alton-diary Tindle Newspapers tindlenews.co.uk
29 Andersonstown News https://media.info//newspapers/titles/andersonstown-news Belfast Media Group irelandclick.com
30 Andover Advertiser https://media.info//newspapers/titles/andover-advertiser Newsquest Media Group andoveradvertiser.co.uk
31 Anfield and Walton Star https://media.info//newspapers/titles/anfield-and-walton-star Reach plc icliverpool.co.uk
32 The Anglo-Celt https://media.info//newspapers/titles/the-anglo-celt None anglocelt.ie
33 Annandale Herald https://media.info//newspapers/titles/annandale-herald Dumfriesshire Newspaper Group dng24.co.uk
34 Annandale Observer https://media.info//newspapers/titles/annandale-observer Dumfriesshire Newspaper Group dng24.co.uk
35 Antrim Times https://media.info//newspapers/titles/antrim-times JPI Media antrimtoday.co.uk
36 Arbroath Herald https://media.info//newspapers/titles/arbroath-herald JPI Media arbroathherald.com
37 The Arden Observer https://media.info//newspapers/titles/the-arden-observer Bullivant Media ardenobserver.co.uk
38 Ardrossan & Saltcoats Herald https://media.info//newspapers/titles/ardrossan-saltcoats-herald Newsquest Media Group ardrossanherald.com
39 The Argus https://media.info//newspapers/titles/the-argus Newsquest Media Group theargus.co.uk
40 Argyllshire Advertiser https://media.info//newspapers/titles/argyllshire-advertiser Oban Times Group argyllshireadvertiser.co.uk
41 Armthorpe Community Newsletter https://media.info//newspapers/titles/armthorpe-community-newsletter JPI Media jpimedia.co.uk
42 The Arran Banner https://media.info//newspapers/titles/the-arran-banner Oban Times Group arranbanner.co.uk
43 The Arran Voice https://media.info//newspapers/titles/the-arran-voice Independent News Ltd voiceforarran.com
44 The Art Newspaper https://media.info//newspapers/titles/the-art-newspaper None theartnewspaper.com
45 Ashbourne News Telegraph https://media.info//newspapers/titles/ashbourne-news-telegraph Reach plc ashbournenewstelegraph.co.uk
46 Ashby Echo https://media.info//newspapers/titles/ashby-echo Reach plc reachplc.com
47 Ashby Mail https://media.info//newspapers/titles/ashby-mail DMGT thisisleicestershire.co.uk
48 Ashfield Chad https://media.info//newspapers/titles/ashfield-chad JPI Media chad.co.uk
49 Ashford Adscene https://media.info//newspapers/titles/ashford-adscene DMGT thisiskent.co.uk
You can extract more information for each newspaper, as you wish - the above is just an example, going through the first 50 newspapers.您可以根据需要为每份报纸提取更多信息 - 以上只是一个示例,通过前 50 份报纸。 Now if you want a multithreaded/async solution, I recommend you read the following, and apply it to your own scenario: BeautifulSoup getting href of a list - need to simplify the script - replace multiprocessing
现在,如果您想要一个多线程/异步解决方案,我建议您阅读以下内容,并将其应用于您自己的场景: BeautifulSoup 获取列表的 href - 需要简化脚本 - 替换多处理
Lastly, Requests docs can be found here: https://requests.readthedocs.io/en/latest/最后,可以在此处找到请求文档: https://requests.readthedocs.io/en/latest/
BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html BeautifulSoup 文档: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
For TQDM: https://pypi.org/project/tqdm/对于 TQDM: https://pypi.org/project/tqdm/
names = []
for letter in string.ascii_lowercase:
page = requests.get("https://media.info/newspapers/titles/starting-with/{}".format(letter))
soup = BeautifulSoup(page.content, "html.parser")
for i in soup.find_all("a"):
if i['href'].startswith("/newspapers/titles/"):
names.append(i['href'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.