简体   繁体   English

从JS重定向链接下载pdf

[英]Download a pdf from a JS redirected link

Is there some way download the following pdf from the command line? 有什么方法可以从命令行下载以下pdf文件吗?

http://www.ofsted.gov.uk/provider/files/1295389/urn/EY298883.pdf   

A simple wget http://www.ofsted.gov.uk/provider/files/1295389/urn/EY298883.pdf returns a web page. 一个简单的wget http://www.ofsted.gov.uk/provider/files/1295389/urn/EY298883.pdf返回一个网页。 However if you go to it in firefox you get a pdf. 但是,如果您在Firefox中使用它,则会得到pdf。

Related to How to get a JS redirected pdf linked from a web page where I tried to find a python solution. 如何从我试图找到python解决方案的网页上获取JS重定向的pdf链接相关

If you don't need a universal answer that simulates a web browser and runs the JS (you need to do this to get a universal solution), but are fine with just finding the download link from the html you get by yourself, then you can: 如果您不需要模拟网络浏览器并运行JS的通用答案(需要执行此操作即可获得通用解决方案),但是只需从自己获得的html中找到下载链接就可以了,那么您能够:

  1. wget the page (wget will follow HTTP redirect so that this will give you the target html with the JS that does the download) wget页面(wget将遵循HTTP重定向,这样可以通过执行下载的JS为您提供目标html)
  2. you then need to parse the HTML and find the link you're looking for 然后,您需要解析HTML并找到所需的链接
  3. you need to wget that link 你需要wget这个链接

I wrote some simple scripts to do 2,3 for you at https://github.com/pjump/wgetbyCss In order to use them, you need 我在https://github.com/pjump/wgetbyCss上编写了一些简单的脚本来为您做2,3。要使用它们,您需要

  • ruby 红宝石
  • the mechanize gem ( gem install mechanize ) 机械化gem( gem install mechanize

Then you can do: 然后,您可以执行以下操作:

 ./wget_by_link_text 'http://www.ofsted.gov.uk/filedownloading/?id=1295389&type=1&refer=1' "Please download the requested file here"

ie: 即:

   ./wget_by_link_text url link_text [save_as]

To get that link by its text. 通过文本获取该链接。 Alternatively, you can use the wget_by_css script and get the link by its .auto_click class, or some other css selector. 或者,您可以使用wget_by_css脚本并通过其.auto_click类或其他某些CSS选择器获取链接。

in short: you can't using wget / curl 简而言之: 您不能使用wget / curl

You could use curl -L constrains curl to follow redirection 您可以使用curl -L约束curl来遵循重定向

 curl -L http://www.ofsted.gov.uk/provider/files/1295389/urn/EY298883.pdf

But it doesn't work as you can see curl-FAQ : 但这不起作用,因为您可以看到curl-FAQ

4.14 Redirects work in browser but not with curl! 4.14重定向在浏览器中有效,但不能使用curl!

curl supports HTTP redirects fine (see item 3.8). curl支持HTTP重定向正常(请参阅第3.8条)。 Browsers generally support at least two other ways to perform redirects that curl does not: 浏览器通常支持至少两种其他方式来执行curl不支持的重定向:

Meta tags. 元标记。 You can write a HTML tag that will cause the browser to redirect to another given URL after a certain time. 您可以编写一个HTML标签,该标签将导致浏览器在一定时间后重定向到另一个给定的URL。

Javascript. Javascript。 You can write a Javascript program embedded in a HTML page that redirects the browser to another given URL. 您可以编写嵌入在HTML页面中的Javascript程序,以将浏览器重定向到另一个给定的URL。

There is no way to make curl follow these redirects. 无法使curl遵循这些重定向。 You must either manually figure out what the page is set to do, or you write a script that parses the results and fetches the new URL. 您必须手动找出要设置的页面,或者编写一个脚本来解析结果并获取新的URL。

So I think bad news, you will have to do it by yourself within a script, see your other question as reference: How to get a JS redirected pdf linked from a web page 因此,我认为这是一个坏消息,您将必须在脚本中自己完成操作,请参见其他问题作为参考: 如何从网页链接获得JS重定向的pdf


Consider to use seleniumhq the queen's website seems to be a hard nut for crawlers. 考虑使用seleniumhq ,女王的网站似乎对爬虫来说很难。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM