简体   繁体   English

如何从网页获取JS重定向的pdf链接

[英]How to get a JS redirected pdf linked from a web page

I am using requests to get web pages, for example as follows. 我正在使用获取网页的requests ,例如,如下。

import requests
from bs4 import BeautifulSoup
url = "http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/CARE/EY298883"
r = requests.get(url)
soup = BeautifulSoup(r.text)

For each one of these pages I would like to get the first pdf that is point to in the section titled "Latest reports". 对于这些页面中的每个页面,我都希望获得标题为“最新报告”部分中指向的第一个pdf文件。 How can you do this with beautiful soup? 你怎么能配上漂亮的汤呢?

The relevant part of the HTML is HTML的相关部分是

 <tbody>
 <tr>
          <th scope="col">Latest reports</th>
          <th scope="col" class="date">Inspection <br/>date</th>
          <th scope="col" class="date">First<br/>publication<br/>date</th>
          </tr>
          <tr>
            <td><a href="/provider/files/1266031/urn/106428.pdf"><span class="icon pdf">pdf</span> Early years inspection report </a></td>
            <td class="date">12 Mar 2009</td>
            <td class="date">4 Apr 2009</td>
            </tr>        </tbody>

The following code looks like it should work but does not. 下面的代码看起来应该可以,但是不能。

 ofstedbase = "http://www.ofsted.gov.uk" for col_header in soup.findAll('th'): if not col_header.contents[0] == "Latest reports": continue for link in col_header.parent.parent.findAll('a'): if 'href' in link.attrs and link['href'].endswith('pdf'): break else: print '"Latest reports" PDF not found' break print '"Latest reports" PDF points at', link['href'] p = requests.get(ofstedbase+link['href']) print p.content break 

The problem is that p contains another web page and not the pdf it should. 问题是p包含另一个网页,而不是它应包含的pdf。 Is there some way to get the actual pdf? 有什么方法可以获取实际的pdf?


Update: 更新:

Got it to work with one more iteration of BeautifulSoup 使它与BeautifulSoup的另一个迭代一起工作

 souppage = BeautifulSoup(p.text)
 line = souppage.findAll('a',text=re.compile("requested"))[0]
 pdf = requests.get(ofstedbase+line['href'])

Any better/nicer solutions gratefully received. 任何更好/更好的解决方案都将不胜感激。

It's not the cleanest solution, but you can iterate through column headers until you find "Latest reports", then search that table for the first link that points at a PDF file. 这不是最干净的解决方案,但是您可以遍历列标题,直到找到“最新报告”,然后在该表中搜索指向PDF文件的第一个链接。

for col_header in soup.findAll('th'):
    if not col_header.contents[0] == "Latest reports": continue
    for link in col_header.parent.parent.findAll('a'):
        if 'href' in link.attrs and link['href'].endswith('pdf'): break
    else:
        print '"Latest reports" PDF not found'
        break
    print '"Latest reports" PDF points at', link['href']
    break

You might try Selenium WebDriver ( python -m "easy_install" selenium ) to automatically instruct Firefox to download the file. 您可以尝试使用Selenium WebDriver( python -m "easy_install" selenium )自动指示Firefox下载文件。 This requires Firefox: 这需要Firefox:

from selenium import webdriver
from bs4 import BeautifulSoup

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', ('application/pdf'))
profile.set_preference("pdfjs.previousHandler.alwaysAskBeforeHandling", False)
profile.set_preference("browser.helperApps.alwaysAsk.force", False)
profile.set_preference("browser.download.manager.showWhenStarting", False)

driver = webdriver.Firefox(firefox_profile = profile)
base_url = "http://www.ofsted.gov.uk"
driver.get(base_url + "/inspection-reports/find-inspection-report/provider/CARE/EY298883")
soup = BeautifulSoup(driver.page_source)

for col_header in soup.findAll('th'):
    if not col_header.contents[0] == "Latest reports": continue
    for link in col_header.parent.parent.findAll('a'):
        if 'href' in link.attrs and link['href'].endswith('pdf'): break
    else:
        print '"Latest reports" PDF not found'
        break
    print '"Latest reports" PDF points at', link['href']
    driver.get(base_url + link['href'])

This solution is very powerful because it can do everything a human user can, but it has drawbacks. 该解决方案非常强大,因为它可以完成人类用户可以做的所有事情,但是它也有缺点。 For example, I've tried to address the issue of Firefox prompting for the download, but it doesn't work for me. 例如,我试图解决Firefox提示下载的问题,但对我来说不起作用。 Results may vary depending on your installed add-ons and Firefox version. 结果可能会有所不同,具体取决于您安装的加载项和Firefox版本。

Got it to work with one more iteration of BeautifulSoup 使它与BeautifulSoup的另一个迭代一起工作

 souppage = BeautifulSoup(p.text)
 line = souppage.findAll('a',text=re.compile("requested"))[0]
 pdf = requests.get(ofstedbase+line['href'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM