[英]Web scraping - saving files to nested folders
I am downloading pdf files from different URLs using a built-in API.我正在使用内置的 API 从不同的 URL 下载 pdf 文件。
My end result should be to download files from each unique link (identified as links
in the code below) to unique folders ( folder_location
in the code) on the desktop.我的最终结果应该是将文件从每个唯一链接(在下面的代码中标识为links
)下载到桌面上的唯一文件夹(代码中的folder_location
)。
I am quite puzzled by how I should arrange codes to do this as I am still a novice.我对如何安排代码来做到这一点感到很困惑,因为我还是个新手。 So far I have tried the following.到目前为止,我已经尝试了以下方法。
import os
import requests
from glob import glob
import time
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = ["P167897", "P173997", "P166309"]
folder_location = "/pdf/"
for link, folder in zip(links, folder_location):
time.sleep(10)
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
filename = os.path.join(folder,pdf_url.split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(pdf_url).content)
EDIT: To clarify, the objects in links
are id based on which links to the pdf files are to be identified from the API.编辑:澄清一下, links
中的对象是 id,基于从 API 中识别到 pdf 文件的链接。
You could try using the pathlib
module.您可以尝试使用pathlib
模块。
Here's how:就是这样:
import os
import time
from pathlib import Path
import requests
links = ["P167897", "P173997", "P166309"]
for link in links:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
file_path = Path(f"pdf/{link}/{pdf_url.rsplit('/')[-1]}")
file_path.parent.mkdir(parents=True, exist_ok=True)
with file_path.open("wb") as f:
f.write(requests.get(pdf_url).content)
time.sleep(10)
except KeyError:
continue
This outputs files to:这会将文件输出到:
pdf/
└── P167897
├── Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.pdf
└── Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.pdf
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.