简体   繁体   English

如何修改此脚本以检查HTTP状态(404、200)

[英]How to modify this script to check for HTTP status (404, 200)

I am currently using the following script to load a list of URLs then check the source of each for a list of error strings. 我目前正在使用以下脚本加载URL列表,然后检查每个URL的源以获取错误字符串列表。 If no error string is found in the source, the URL is considered valid and written to a text file. 如果在源中未找到错误字符串,则该URL被认为是有效的并将其写入文本文件。

How can I modify this script to check for HTTP status instead? 如何修改此脚本以检查HTTP状态呢? If a URL returns a 404 it would be ignored, if it returns 200 the URL would be written to the text file. 如果URL返回404,则将被忽略;如果返回200,则将URL写入文本文件。 Any help would be much appreciated. 任何帮助将非常感激。

import urllib2
import sys

error_strings = ['invalid product number', 'specification not available. please contact   customer services.']

def check_link(url):
if not url:
    return False
f = urllib2.urlopen(url)    
html = f.read()
result = False
if html:
    result = True
    html = html.lower()
    for s in error_strings:
        if s in html:
            result = False
            break
return result


if __name__ == '__main__':
if len(sys.argv) == 1:
    print 'Usage: %s <file_containing_urls>' % sys.argv[0]
else:
    output = open('valid_links.txt', 'w+')
    for url in open(sys.argv[1]):
        if(check_link(url.strip())):
            output.write('%s\n' % url.strip());
    output.flush()
    output.close()

You can alter your call to urlopen slightly: 您可以稍微更改对urlopen的调用:

>>> try:
...     f = urllib2.urlopen(url)
... except urllib2.HTTPError, e:
...     print e.code
...
404

Utilizing the e.code , you can check if it 404s on you. 利用e.code ,您可以检查它是否在您404上。 If you don't hit the except block, you can utilize f as you currently do. 如果您未按下except块,则可以像当前一样使用f

urlib2.urlopen gives back a file-like object with some other methods, one of which: getcode() is what you're looking for, just add a line: urlib2.urlopen会使用其他一些方法返回文件状对象,其中一种方法是: getcode()是您要查找的内容,只需添加一行:

if f.getcode() != 200:
    return False

In the relevant place 在相关的地方

Try this. 尝试这个。 You can use this 你可以用这个

 def check_link(url):
        if not url:
            return False
        code = None
        try:
            f = urllib2.urlopen(url)
            code = f.getCode()
        except urllib2.HTTPError, e:
            code = e.code
        result = True
        if code != 200:
            result = False
        return result

Alternatively, if you just need to maintain a list of invalid code strings and check against those, it will be something like below. 另外,如果您只需要维护无效代码字符串的列表并进行检查,则将类似于以下内容。

def check_link(url):
    if not url:
        return False
    code = None
    try:
        f = urllib2.urlopen(url)
        code = f.getCode()
    except urllib2.HTTPError, e:
        code = e.code

    result = True
    if code in invalid_code_strings:
         result = False

    return result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM