[英]I am trying to scrape the titles from the PDFs on this website. However, I get the titles and the links. Why and how can I fix this?
我想在這個網站上刮掉 PDF 的標題。 但是,我得到了標題和鏈接。 我怎樣才能解決這個問題?
publications=[]
text=[]
for i in np.arange(12,19):
response=requests.get('https://occ.ca/our-
publications/page/{}/'.format(i), headers={'User-Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.findAll('div', {"class": "publicationoverlay"})
links = [pdf.find('a').attrs['href'] for pdf in pdfs]
publications.extend(links)
text.extend(pdfs)
任何幫助將非常感激。
您希望.text
雖然在\t
上拆分(以排除a
文本)並剝離。 我使用Session
來提高效率。
import requests
from bs4 import BeautifulSoup
import numpy as np
publications=[]
text=[]
with requests.Session() as s:
for i in np.arange(12,19):
response= s.get('https://occ.ca/our-publications/page/{}/'.format(i), headers={'User-Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.findAll('div', {"class": "publicationoverlay"})
text.extend([pdf.text.strip().split('\t')[0] for pdf in pdfs])
您還可以在獲取 href 之后和獲取父級的 .text 之前使用 decompose 刪除子級標簽
import requests
from bs4 import BeautifulSoup
import numpy as np
publications=[]
text=[]
links = []
with requests.Session() as s:
for i in np.arange(12,19):
response= s.get('https://occ.ca/our-publications/page/{}/'.format(i), headers={'User-Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
for a in soup.select('.publicationoverlay a'):
links.extend([a['href']])
a.decompose()
pdfs = soup.findAll('div', {"class": "publicationoverlay"})
text.extend([pdf.text.strip() for pdf in pdfs])
print(list(zip(links, text)))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.