如果链接不包含.pdf，如何测试链接目标是否为pdf文件

Question

I'm using selenium to scrape a bunch of files which are provided in a mix of formats and styles - trying to handle both html and pdf, and I've come across an issue when the target of a link is a pdf file, but the link itself does not contain '.pdf' eg , and (note that one automatically downloads, and one just displays the file - at least in chrome - so there may need to be a test for two different types of pdf targets as well?)我正在使用 selenium 来抓取一堆以混合格式和样式提供的文件 - 尝试同时处理 html 和 pdf，当链接的目标是 pdf 文件时，我遇到了一个问题，但是链接本身不包含“.pdf”，例如，和（请注意，一个会自动下载，一个只显示文件 - 至少在 chrome 中 - 所以可能还需要对两种不同类型的 pdf 目标进行测试？ )

Is there a way to tell programmatically if the target of a link is pdf that is more intelligent than just checking if it ends in .pdf?有没有办法以编程方式判断链接的目标是否是 pdf，它比仅检查它是否以 .pdf 结尾更智能？

I can't just download the file no matter the content, because I have distinct handling for the html files, where I want to follow secondary links and see if I can find pdfs, which won't work if the target is a pdf directly.无论内容如何，我都不能只下载文件，因为我对 html 文件有不同的处理，我想在其中关注辅助链接并查看是否可以找到 pdf，如果目标是直接 pdf，这将不起作用.

ETA: The accepted answer worked perfectly - the linked potential dupe is for testing on file system, not for download so I don't think that's valid, and certainly the answer below is better for this situation. ETA：接受的答案效果很好 - 链接的潜在欺骗是用于测试文件系统，而不是用于下载，所以我认为这不是有效的，当然下面的答案更适合这种情况。

Answer 1

Selenium (or Chrome) checks the 'Content-Type' headers and choose what to do. Selenium（或 Chrome）检查'Content-Type'标头并选择要执行的操作。 You can also check the 'Content-Type' of a URL yourself use requests like below:您还可以自己使用requests检查 URL 的'Content-Type' ，如下所示：

>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> pprint.pprint(dict(r.headers))
{'Accept-Ranges': 'bytes',
  'Age': '8518',
  'Cache-Control': 'no-cache, must-revalidate, max-age=0',
  'Connection': 'keep-alive',
  'Content-Description': 'File Transfer',
  'Content-Disposition': 'attachment; '
  'filename="anzcor-guideline-6-compressions-apr-2021.pdf"',
    'Content-Length': '535677',
  'Content-Md5': '90AUQUZu0vFGJ7cBPvRxcg==',
  'Content-Security-Policy': 'upgrade-insecure-requests',
  'Content-Type': 'application/pdf',
  'Date': 'Wed, 19 Jan 2022 11:20:06 GMT',
  'Expires': 'Wed, 11 Jan 1984 05:00:00 GMT',
  'Last-Modified': 'Wed, 19 Jan 2022 08:58:08 GMT',
  'Pragma': 'no-cache',
  'Server': 'openresty',
  'Strict-Transport-Security': 'max-age=300, max-age=31536000; '
  'includeSubDomains',
    'Vary': 'User-Agent',
  'X-Backend': 'local',
  'X-Cache': 'cached',
  'X-Cache-Hit': 'HIT',
  'X-Cacheable': 'YES:Forced',
  'X-Content-Type-Options': 'nosniff',
  'X-Xss-Protection': '1; mode=block'}

As you can see, the 'Content-Type' of your two links are all 'application/pdf' :如您所见，您的两个链接的'Content-Type'都是'application/pdf' ：

>>> r.headers['Content-Type']
'application/pdf'

So you can just check the output of requests.head(link).headers['Content-Type'] , and do whatever you need.所以你可以检查requests.head(link).headers['Content-Type']的输出，然后做任何你需要的事情。

For this moment (Jan 19 2022), the first link in your question redirects me to a 404 page.此刻（2022 年 1 月 19 日），您问题中的第一个链接将我重定向到 404 页面。 And the second one is still accessible, but it's needed to use HTTPS protocol by changing the link's start part from http:// to https:// .第二个仍然可以访问，但需要通过将链接的开始部分从http://更改为https://来使用 HTTPS 协议。

But anyway, if the URL doesn't redirect you to any other page, this answer isn't out-of-date.但无论如何，如果 URL 没有将您重定向到任何其他页面，则此答案不会过时。 If the URL does, please request the newest URL by checking the status_code if it's a 301:如果 URL 是，请通过检查status_code请求最新的 URL（如果它是 301）：

>>> r = requests.head('http://resus.org.au/?wpfb_dl=17')
>>> r.status_code
301
>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> r.status_code
200
>>>

如果链接不包含.pdf，如何测试链接目标是否为pdf文件

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-05-20 01:34:19

如果链接不包含.pdf，如何测试链接目标是否为pdf文件

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-05-20 01:34:19

解决方案1
3 已采纳 2016-05-20 01:34:19