简体   繁体   English

检测链接是否无效的问题

[英]Problem with detecting if link is invalid

Is there any way to detect if a link is invalid using webbot?有什么方法可以使用 webbot 检测链接是否无效? I need to tell the user that the link they provided was unreachable.我需要告诉用户他们提供的链接无法访问。

The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works.完全确定 url 将您发送到有效页面的唯一方法是获取该页面并检查它是否有效。 You could try making a request other than GET to try to avoid the wasted bandwith downloading the page, but not all servers will respond: the only way to be absolutely sure is to GET and see what happens.可以尝试发出 GET 以外的请求,以避免浪费下载页面的带宽,但并非所有服务器都会响应:唯一确定的方法是 GET 并查看会发生什么。 Something like:就像是:

import requests
from requests.exceptions import ConnectionError

def check_url(url):
    try:
        r = requests.get(url, timeout=1)
        return r.status_code == 200
    except ConnectionError:
        return False

Is this a good idea?这是一个好主意吗? It's only a GET request, and get is supposed to idempotent, so you shouldn't cause anybody any harm.它只是一个 GET 请求,并且 get 应该是幂等的,所以你不应该对任何人造成任何伤害。 On the other hand, what if a user sets up a script to add a new link every second pointing to the same website?另一方面,如果用户设置脚本以每秒添加一个指向同一网站的新链接怎么办? Then you're DDOSing that website.然后你正在对那个网站进行 DDOS 攻击。 So when you allow users to cause your server to do things like this, you need to think how you might protect it.因此,当您允许用户使您的服务器执行此类操作时,您需要考虑如何保护它。 (In this case: you could keep a cache of valid links expiring every n seconds, and only look up if the cache doesn't hold the link.) (在这种情况下:您可以保留每 n 秒过期的有效链接缓存,并且仅在缓存不包含链接时查找。)

Note that if you just want to check the link points to a valid domain it's a bit easier: you can just do a dns query.请注意,如果您只想检查指向有效域的链接点,它会更容易一些:您只需执行 dns 查询即可。 (The same point about caching and avoiding abuse probably applies.) (关于缓存和避免滥用的观点可能同样适用。)

Note that I used requests, because it is easy, but you likely want to do this in the bacground, either with requests in a thread, or with one of the asyncio http libraries and an asyncio event loop.请注意,我使用了请求,因为它很容易,但您可能希望在免费环境中执行此操作,或者使用线程中的请求,或者使用asyncio http 库和 asyncio 事件循环之一。 Otherwise your code will block for at least timeout seconds.否则,您的代码将至少阻塞timeout秒。

(Another attack: this gets the whole page . What if a user links to a massive page? See this question for a discussion of protecting from oversize responses. For your use case you likely just want to get a few bytes. I've deliberately not complicated the example code here because I wanted to illustrate the principle.) (另一种攻击:这会获取整个页面。如果用户链接到一个大页面怎么办?请参阅此问题以讨论防止过大响应的讨论。对于您的用例,您可能只想获取几个字节。我故意这里的示例代码并不复杂,因为我想说明原理。)

Note that this just checks that something is available on that page.请注意,只是检查该页面上是否有可用的内容。 What if it's one of the many dead links which redirects to a domain-name website?如果它是重定向到域名网站的众多死链接之一怎么办? You could enforce 'no redirects'---but then some redirects are valid.可以强制执行“无重定向”——但某些重定向是有效的。 (Likewise, you could try to detect redirects up to the main domain or to a blacklist of venders' domains, but this will always be imperfect.) There is a tradeoff here to consider, which depends on your concrete use case, but it's worth being aware of. (同样,您可以尝试检测到主域或供应商域黑名单的重定向,但这总是不完美的。)这里需要考虑权衡,这取决于您的具体用例,但值得意识到。

You could try sending an HTTP request, opening the result, and have a list of known error codes, 404, etc. You can easily implement this in Python and is efficient and quick.您可以尝试发送 HTTP 请求,打开结果,并列出已知错误代码、404 等。您可以在 Python 中轻松实现这一点,并且高效快捷。 Be warned that SOMETIMES (quite rarely) a website might detect your scraper and artificially return an Error Code to confuse you.请注意,有时(很少)网站可能会检测到您的抓取工具并人为地返回错误代码以使您感到困惑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM