简体   繁体   English

我正在尝试从本网站上的 PDF 中抓取标题。 但是,我得到了标题和链接。 为什么以及如何解决这个问题?

[英]I am trying to scrape the titles from the PDFs on this website. However, I get the titles and the links. Why and how can I fix this?

I want to scrape the titles of the PDFs on this website.我想在这个网站上刮掉 PDF 的标题。 However, I get the titles and the links.但是,我得到了标题和链接。 How can I fix this?我怎样才能解决这个问题?

publications=[]
text=[]
for i in np.arange(12,19):
    response=requests.get('https://occ.ca/our- 
 publications/page/{}/'.format(i), headers={'User-Agent': 'Mozilla'})

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'lxml')
    pdfs = soup.findAll('div', {"class": "publicationoverlay"})

    links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.extend(links)
    text.extend(pdfs)

Any help would be much appreciated.任何帮助将非常感激。

You want the .text though split on \t (to exclude child a text) and strip.您希望.text虽然在\t上拆分(以排除a文本)并剥离。 I use Session for efficiency.我使用Session来提高效率。

import requests
from bs4 import BeautifulSoup 
import numpy as np

publications=[]
text=[]

with requests.Session() as s:

    for i in np.arange(12,19):

        response= s.get('https://occ.ca/our-publications/page/{}/'.format(i), headers={'User-Agent': 'Mozilla'})

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'lxml')
            pdfs = soup.findAll('div', {"class": "publicationoverlay"})
            text.extend([pdf.text.strip().split('\t')[0] for pdf in pdfs])

You could also use decompose to remove child a tags after getting href and before taking the.text of parent您还可以在获取 href 之后和获取父级的 .text 之前使用 decompose 删除子级标签

import requests
from bs4 import BeautifulSoup 
import numpy as np

publications=[]
text=[]
links = []

with requests.Session() as s:

    for i in np.arange(12,19):

        response= s.get('https://occ.ca/our-publications/page/{}/'.format(i), headers={'User-Agent': 'Mozilla'})

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'lxml')
            for a in soup.select('.publicationoverlay a'):
                links.extend([a['href']])
                a.decompose()
            pdfs = soup.findAll('div', {"class": "publicationoverlay"})
            text.extend([pdf.text.strip() for pdf in pdfs])

print(list(zip(links, text)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我想从网站上抓取文章标题,但结果没有显示 - I wanted to scrape article titles from a website but result shows none 为什么我无法从URL获得曲目标题? - Why can't I get track titles from url? 我无法从该网站上抓取项目。 Python - I can not scrape item from this website. Python 如何使用 Python 从这个 javascript 页面中抓取职业道路职位 - How can I scrape career path job titles from this javascript page using Python 我正在尝试抓取网站的链接,并在已经抓取的链接中抓取链接 - I am trying to scrape a website for links and also scrape the links inside the already scraped links Webcrawler BeautifulSoup-如何从没有类标签的链接中获取标题 - Webcrawler BeautifulSoup - how do I get titles from links without class tags 如何从网站上抓取股票表。 我认为 class 或标签是一个问题,但我不知道 - How to scrape a stock table from a website. I think class or tag is a problem but I can't figure out 如何使用 BeautifulSoup 抓取超链接标题? - How do I scrape hyperlink titles using BeautifulSoup? 如何使matplotlib标题可搜索? - How can I make matplotlib titles searchable? 为什么当我尝试在此 web 页面上抓取指向 PDF 的链接时,我只得到一个空列表作为回报? - Why Is it that when I try to scrape the links to the PDFs on this web page I just get an empty list in return?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM