如何修改此脚本以检查HTTP状态（404、200）

Question

I am currently using the following script to load a list of URLs then check the source of each for a list of error strings. 我目前正在使用以下脚本加载URL列表，然后检查每个URL的源以获取错误字符串列表。 If no error string is found in the source, the URL is considered valid and written to a text file. 如果在源中未找到错误字符串，则该URL被认为是有效的并将其写入文本文件。

How can I modify this script to check for HTTP status instead? 如何修改此脚本以检查HTTP状态呢？ If a URL returns a 404 it would be ignored, if it returns 200 the URL would be written to the text file. 如果URL返回404，则将被忽略；如果返回200，则将URL写入文本文件。 Any help would be much appreciated. 任何帮助将非常感激。

import urllib2
import sys

error_strings = ['invalid product number', 'specification not available. please contact   customer services.']

def check_link(url):
if not url:
    return False
f = urllib2.urlopen(url)    
html = f.read()
result = False
if html:
    result = True
    html = html.lower()
    for s in error_strings:
        if s in html:
            result = False
            break
return result


if __name__ == '__main__':
if len(sys.argv) == 1:
    print 'Usage: %s <file_containing_urls>' % sys.argv[0]
else:
    output = open('valid_links.txt', 'w+')
    for url in open(sys.argv[1]):
        if(check_link(url.strip())):
            output.write('%s\n' % url.strip());
    output.flush()
    output.close()

Answer 1

You can alter your call to urlopen slightly: 您可以稍微更改对urlopen的调用：

>>> try:
...     f = urllib2.urlopen(url)
... except urllib2.HTTPError, e:
...     print e.code
...
404

Utilizing the e.code , you can check if it 404s on you. 利用e.code ，您可以检查它是否在您404上。 If you don't hit the except block, you can utilize f as you currently do. 如果您未按下except块，则可以像当前一样使用f 。

Answer 2

urlib2.urlopen gives back a file-like object with some other methods, one of which: getcode() is what you're looking for, just add a line: urlib2.urlopen会使用其他一些方法返回文件状对象，其中一种方法是： getcode()是您要查找的内容，只需添加一行：

if f.getcode() != 200:
    return False

In the relevant place 在相关的地方

Answer 3

Try this. 尝试这个。 You can use this 你可以用这个

 def check_link(url):
        if not url:
            return False
        code = None
        try:
            f = urllib2.urlopen(url)
            code = f.getCode()
        except urllib2.HTTPError, e:
            code = e.code
        result = True
        if code != 200:
            result = False
        return result

Alternatively, if you just need to maintain a list of invalid code strings and check against those, it will be something like below. 另外，如果您只需要维护无效代码字符串的列表并进行检查，则将类似于以下内容。

def check_link(url):
    if not url:
        return False
    code = None
    try:
        f = urllib2.urlopen(url)
        code = f.getCode()
    except urllib2.HTTPError, e:
        code = e.code

    result = True
    if code in invalid_code_strings:
         result = False

    return result

如何修改此脚本以检查HTTP状态（404、200）

问题描述

3 个解决方案

解决方案1
1 已采纳 2014-10-17 14:17:30

解决方案2
0 2014-10-17 14:15:39

解决方案3
0 2014-10-17 14:18:13

如何修改此脚本以检查HTTP状态（404、200）

问题描述

3 个解决方案

解决方案1 1 已采纳 2014-10-17 14:17:30

解决方案2 0 2014-10-17 14:15:39

解决方案3 0 2014-10-17 14:18:13

解决方案1
1 已采纳 2014-10-17 14:17:30

解决方案2
0 2014-10-17 14:15:39

解决方案3
0 2014-10-17 14:18:13