[英]Download PDF using Selenium in Python + save each PDF with an assigned name
先生们,
目的是1)为一堆公司抓取一堆 PDF 2)用相应的公司名称保存它们,全部来自https://www1.hkexnews.hk/app/appyearlyindex.html?lang=en&board=mainBoard 。
我的代码适用于下载 PDF,但负责自动下载的代码段很方便:
chrome_options = Options()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": "/Users/XXX/Downloads", #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})
我仍然需要使用每个相应公司的名称来保存每个 PDF 而不仅仅是随机的 PDF 文件名。 可以使用以下方法刮取公司名称:
all_names = driver.find_elements_by_xpath("//div[@class='applicant-name']")
但是如何修改下面的完整代码以包含一个循环,该循环使用每个公司名称(而不是随机文件名)保存每个 PDF 文件? 如果可以的话请帮忙:
chrome_options = Options()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": "/Users/XXX/Downloads", #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})
year = str(input("Please enter the year for which you want to download the Application Proofs: "))
link = "https://www1.hkexnews.hk/app/appyearlyindex.html?lang=en&board=mainBoard&year=" + year
print("Now loading: ", link)
print("Found the following companies: ")
driver = webdriver.Chrome('/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/chromedriver',options=chrome_options)
wait = WebDriverWait(driver,10)
driver.get(link)
all_proofs = driver.find_elements_by_xpath("//tr[@class='record-ap-phip']//a[contains(.,'Full Version')]")
all_names = driver.find_elements_by_xpath("//div[@class='applicant-name']")
for i in all_names:
print('---> ', i.text)
print("\nTotal number of proofs in year ",year,": ",len(all_proofs))
Y = 0
N = 0
for proof in all_proofs:
try:
proof.click()
wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='warning-statement-dialog']//label[@for='warning-statement-accept']"))).click()
wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='warning-statement-dialog']//a[contains(@class,'btn-ok')]"))).click()
Y += 1
except Exception as exc:
exception = f'An exception occurred.'
N += 1
print("Number of application proofs downloaded: ", Y)
print("Number of exceptions: ", N)
正如 RJ Adriaansen 指出的那样,开发人员工具 - 网络 - 获取/XHR 中有一个 JSON 文件,无需 Selenium 即可轻松抓取:
import requests
import re
data_url = 'https://www1.hkexnews.hk/ncms/json/eds/app_2022_sehk_e.json?_=1641899494829' #found in the Developer Tools - Network - fetch/XHR
data = requests.get(data_url).json()
for company in data['app']:
filename = re.sub(r'[^\w\-_ ]', '_',company['a'])+'.pdf' #company name remove bad characters for filename
pdf_url = 'https://www1.hkexnews.hk/app/'+company['ls'][0]['u1']
pdf_data = requests.get(pdf_url)
print(f'Saving {filename}')
with open(filename,'wb') as file:
file.write(pdf_data.content)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.