[英]How to modify this script to check for HTTP status (404, 200)
I am currently using the following script to load a list of URLs then check the source of each for a list of error strings. 我目前正在使用以下脚本加载URL列表,然后检查每个URL的源以获取错误字符串列表。 If no error string is found in the source, the URL is considered valid and written to a text file. 如果在源中未找到错误字符串,则该URL被认为是有效的并将其写入文本文件。
How can I modify this script to check for HTTP status instead? 如何修改此脚本以检查HTTP状态呢? If a URL returns a 404 it would be ignored, if it returns 200 the URL would be written to the text file. 如果URL返回404,则将被忽略;如果返回200,则将URL写入文本文件。 Any help would be much appreciated. 任何帮助将非常感激。
import urllib2
import sys
error_strings = ['invalid product number', 'specification not available. please contact customer services.']
def check_link(url):
if not url:
return False
f = urllib2.urlopen(url)
html = f.read()
result = False
if html:
result = True
html = html.lower()
for s in error_strings:
if s in html:
result = False
break
return result
if __name__ == '__main__':
if len(sys.argv) == 1:
print 'Usage: %s <file_containing_urls>' % sys.argv[0]
else:
output = open('valid_links.txt', 'w+')
for url in open(sys.argv[1]):
if(check_link(url.strip())):
output.write('%s\n' % url.strip());
output.flush()
output.close()
You can alter your call to urlopen
slightly: 您可以稍微更改对urlopen
的调用:
>>> try:
... f = urllib2.urlopen(url)
... except urllib2.HTTPError, e:
... print e.code
...
404
Utilizing the e.code
, you can check if it 404s on you. 利用e.code
,您可以检查它是否在您404上。 If you don't hit the except
block, you can utilize f
as you currently do. 如果您未按下except
块,则可以像当前一样使用f
。
urlib2.urlopen
gives back a file-like object with some other methods, one of which: getcode()
is what you're looking for, just add a line: urlib2.urlopen
会使用其他一些方法返回文件状对象,其中一种方法是: getcode()
是您要查找的内容,只需添加一行:
if f.getcode() != 200:
return False
In the relevant place 在相关的地方
Try this. 尝试这个。 You can use this 你可以用这个
def check_link(url):
if not url:
return False
code = None
try:
f = urllib2.urlopen(url)
code = f.getCode()
except urllib2.HTTPError, e:
code = e.code
result = True
if code != 200:
result = False
return result
Alternatively, if you just need to maintain a list of invalid code strings and check against those, it will be something like below. 另外,如果您只需要维护无效代码字符串的列表并进行检查,则将类似于以下内容。
def check_link(url):
if not url:
return False
code = None
try:
f = urllib2.urlopen(url)
code = f.getCode()
except urllib2.HTTPError, e:
code = e.code
result = True
if code in invalid_code_strings:
result = False
return result
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.