简体   繁体   English

如果链接不包含.pdf,如何测试链接目标是否为pdf文件

[英]How to test if a link target is a pdf file if the link does not contain .pdf

I'm using selenium to scrape a bunch of files which are provided in a mix of formats and styles - trying to handle both html and pdf, and I've come across an issue when the target of a link is a pdf file, but the link itself does not contain '.pdf' eg , and (note that one automatically downloads, and one just displays the file - at least in chrome - so there may need to be a test for two different types of pdf targets as well?)我正在使用 selenium 来抓取一堆以混合格式和样式提供的文件 - 尝试同时处理 html 和 pdf,当链接的目标是 pdf 文件时,我遇到了一个问题,但是链接本身不包含“.pdf”, 例如(请注意,一个会自动下载,一个只显示文件 - 至少在 chrome 中 - 所以可能还需要对两种不同类型的 pdf 目标进行测试? )

Is there a way to tell programmatically if the target of a link is pdf that is more intelligent than just checking if it ends in .pdf?有没有办法以编程方式判断链接的目标是否是 pdf,它比仅检查它是否以 .pdf 结尾更智能?

I can't just download the file no matter the content, because I have distinct handling for the html files, where I want to follow secondary links and see if I can find pdfs, which won't work if the target is a pdf directly.无论内容如何,​​我都不能只下载文件,因为我对 html 文件有不同的处理,我想在其中关注辅助链接并查看是否可以找到 pdf,如果目标是直接 pdf,这将不起作用.

ETA: The accepted answer worked perfectly - the linked potential dupe is for testing on file system, not for download so I don't think that's valid, and certainly the answer below is better for this situation. ETA:接受的答案效果很好 - 链接的潜在欺骗是用于测试文件系统,而不是用于下载,所以我认为这不是有效的,当然下面的答案更适合这种情况。

Selenium (or Chrome) checks the 'Content-Type' headers and choose what to do. Selenium(或 Chrome)检查'Content-Type'标头并选择要执行的操作。 You can also check the 'Content-Type' of a URL yourself use requests like below:您还可以自己使用requests检查 URL 的'Content-Type' ,如下所示:

>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> pprint.pprint(dict(r.headers))
{'Accept-Ranges': 'bytes',
  'Age': '8518',
  'Cache-Control': 'no-cache, must-revalidate, max-age=0',
  'Connection': 'keep-alive',
  'Content-Description': 'File Transfer',
  'Content-Disposition': 'attachment; '
  'filename="anzcor-guideline-6-compressions-apr-2021.pdf"',
    'Content-Length': '535677',
  'Content-Md5': '90AUQUZu0vFGJ7cBPvRxcg==',
  'Content-Security-Policy': 'upgrade-insecure-requests',
  'Content-Type': 'application/pdf',
  'Date': 'Wed, 19 Jan 2022 11:20:06 GMT',
  'Expires': 'Wed, 11 Jan 1984 05:00:00 GMT',
  'Last-Modified': 'Wed, 19 Jan 2022 08:58:08 GMT',
  'Pragma': 'no-cache',
  'Server': 'openresty',
  'Strict-Transport-Security': 'max-age=300, max-age=31536000; '
  'includeSubDomains',
    'Vary': 'User-Agent',
  'X-Backend': 'local',
  'X-Cache': 'cached',
  'X-Cache-Hit': 'HIT',
  'X-Cacheable': 'YES:Forced',
  'X-Content-Type-Options': 'nosniff',
  'X-Xss-Protection': '1; mode=block'}

As you can see, the 'Content-Type' of your two links are all 'application/pdf' :如您所见,您的两个链接的'Content-Type'都是'application/pdf'

>>> r.headers['Content-Type']
'application/pdf'

So you can just check the output of requests.head(link).headers['Content-Type'] , and do whatever you need.所以你可以检查requests.head(link).headers['Content-Type']的输出,然后做任何你需要的事情。


For this moment (Jan 19 2022), the first link in your question redirects me to a 404 page.此刻(2022 年 1 月 19 日),您问题中的第一个链接将我重定向到 404 页面。 And the second one is still accessible, but it's needed to use HTTPS protocol by changing the link's start part from http:// to https:// .第二个仍然可以访问,但需要通过将链接的开始部分从http://更改为https://来使用 HTTPS 协议。

But anyway, if the URL doesn't redirect you to any other page, this answer isn't out-of-date.但无论如何,如果 URL 没有将您重定向到任何其他页面,则此答案不会过时。 If the URL does, please request the newest URL by checking the status_code if it's a 301:如果 URL 是,请通过检查status_code请求最新的 URL(如果它是 301):

>>> r = requests.head('http://resus.org.au/?wpfb_dl=17')
>>> r.status_code
301
>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> r.status_code
200
>>>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM