[英]Python: AssertionError: proxies must be a mapping
I get this error : 我收到此错误:
Traceback (most recent call last):
File "script.py", line 7, in <module>
proxy = urllib2.ProxyHandler(line)
File "/usr/lib/python2.7/urllib2.py", line 713, in __init__
assert hasattr(proxies, 'has_key'), "proxies must be a mapping"
AssertionError: proxies must be a mapping
when I run the following script: 当我运行以下脚本时:
import urllib2
u=open('urls.txt')
p=open('proxies.txt')
for line in p:
proxy = urllib2.ProxyHandler(line)
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
for url in u:
urllib.urlopen(url).read()
u.close()
p.close()
my urls.txt file has this: 我的urls.txt文件有这个:
'www.google.com'
'www.facebook.com'
'www.reddit.com'
and my proxies.txt has this: 和我的proxies.txt有这个:
{'https': 'https://94.142.27.4:3128'}
{'http': 'http://118.97.95.174:8080'}
{'http':'http://66.62.236.15:8080'}
I found them at hidemyass.com 我在hidemyass.com找到了它们
from the googling I have done, most people that have had this problem have their proxies formatted wrong. 从我所做的谷歌搜索,大多数有这个问题的人的代理格式错误。 Is this the case here?
这是这种情况吗?
As the documentation says: 正如文件所说:
If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies.
如果给出了代理 ,则它必须是将协议名称映射到代理URL的字典。
But in your code, it's just a string. 但是在你的代码中,它只是一个字符串。 In particular, it's one line out of your
proxies.txt
file: 特别是,它是您的
proxies.txt
文件中的一行:
p=open('proxies.txt')
for line in p:
proxy = urllib2.ProxyHandler(line)
Looking at the file, it looks like the lines are intended to be something like the repr
of a Python dictionary. 查看该文件,看起来这些行的目的是类似于Python字典的
repr
。 And, given that all of the keys and values are string literals, that means you could use ast.literal_eval
on it to recover the original dicts: 并且,鉴于所有键和值都是字符串文字,这意味着您可以使用
ast.literal_eval
来恢复原始的dicts:
p=open('proxies.txt')
for line in p:
d = ast.literal_eval(line)
proxy = urllib2.ProxyHandler(d)
Of course that won't work for your sample data, because one of the lines is missing a '
character. 当然,这对您的示例数据不起作用,因为其中一行缺少
'
字符。 But if you fix that, it will… 但如果你解决了这个问题,它会...
However, it would probably be better to use a format that's actually intended for data interchange. 但是,使用实际用于数据交换的格式可能会更好。 For example, JSON is just as human-readable as what you've got, and not all that different:
例如,JSON与您所拥有的一样具有人类可读性,而不是完全不同:
{"https": "https://94.142.27.4:3128"}
{"http": "http://118.97.95.174:8080"}
{"http": "http://66.62.236.15:8080"}
The advantage of using JSON is that there are plenty of tools to validate, edit, etc. JSON, and none for your custom format; 使用JSON的优点是有很多工具可以验证,编辑等JSON,而且没有自定义格式的工具; the rules for what is and isn't valid are obvious, rather than something you have to guess at;
什么是有效的规则是显而易见的,而不是你必须猜测的东西; and the error messages for invalid data will likely be more helpful (like "Expecting property name at line 1 column 10 (char 10)" as opposed to "unexpected EOF while parsing").
并且无效数据的错误消息可能会更有帮助(例如“在第1行第10列(char 10)预期属性名称”而不是“解析时意外的EOF”)。
Note that once you solve this problem, you're going to run into another one with the URLs. 请注意,一旦解决了这个问题,您就会遇到另一个带有URL的问题。 After all,
'www.google.com'\\n
is not what you want, it's www.google.com
. 毕竟,
'www.google.com'\\n
不是你想要的,它是www.google.com
。 So you're going to have to strip off the newline and the quotes. 因此,您将不得不剥离换行符和引号。 Again, you could use
ast.literal_eval
here. 同样,你可以在这里使用
ast.literal_eval
。 Or you could use JSON as an interchange format. 或者您可以使用JSON作为交换格式。
But really, if you're just trying to store one string per line, why not just store the strings as-is, instead of trying to store a string representation of those strings (with the extra quotes on)? 但实际上,如果你只想尝试每行存储一个字符串,为什么不直接存储字符串,而不是试图存储这些字符串的字符串表示(带有额外的引号)?
There are still more problems beyond that. 除此之外还有更多问题。
Even after you get rid of the excess quotes, www.google.com
isn't a URL, it's just a hostname. 即使在您删除多余的报价后,
www.google.com
也不是URL,它只是一个主机名。 http://www.google.com
is what you want here. http://www.google.com
就是您想要的。 Unless you want https://www.google.com
, or some other scheme. 除非您想要
https://www.google.com
或其他一些方案。
You're trying to loop through 'urls.txt'
once for each proxy. 你试图为每个代理循环一次
'urls.txt'
。 That's going to process all of the URLs with just the first proxy installed, and then the remainder (which is nothing, since you already did all of them) with the first two installed, and then the remainder (which is still nothing) with all three installed. 那将只处理安装了第一个代理的所有URL,然后安装前两个(其余都没有,因为你已经完成了所有这些),然后安装了前两个,然后剩下的(仍然没有)全部三安装。 Move the
url
loop outside of the proxy
loop. 将
url
循环移到proxy
循环之外。
Finally, these aren't really a problem, but while we're at it… Using a with
statement makes it much easier to write more robust code than using manual close
calls, and it makes your code shorter and more readable to boot. 最后,这些并不是真正的问题,但是当我们处理它时...使用
with
语句使得编写比使用手动close
调用更强大的代码更容易,并且它使您的代码更短,更易读。 Also, it's usually better to wait until you need a file before you try to open it. 此外,在尝试打开文件之前,通常最好还等到需要文件。 And variable names like
u
and p
are just going to cause more confusion in the long run than they'll save typing in the short run. 而像
u
和p
这样的变量名称从长远来看会引起更多混乱,而不是短期内会节省打字。
Oh, and just calling urllib.urlopen(url).read()
and not doing anything with the result won't have any effect except to waste a few seconds and a bit of network bandwidth, but I assume you already knew that, and just left out the details for the sake of simplicity. 哦,只是调用
urllib.urlopen(url).read()
而不对结果做任何事情都没有任何影响,除了浪费几秒钟和一点网络带宽,但我想你已经知道了,为了简单起见,只是遗漏了细节。
Putting it all together, and assuming you fix the two files as described above: 将所有内容放在一起,并假设您按上述方法修复了两个文件:
import json
import urllib2
with open('proxies.txt') as proxies:
for line in proxies:
proxy = json.loads(line)
proxy_handler = urllib2.ProxyHandler(proxy)
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
with open('urls.txt') as urls:
for line in urls:
url = line.rstrip()
data = urllib.urlopen(url).read()
# do something with data
As it turns out, you want to try all of the URLs through each proxy, not try all of them through all the proxies, or through the first and then the first two and so on. 事实证明,您希望通过每个代理尝试所有URL,而不是通过所有代理尝试所有URL,或者通过第一个,然后是前两个,依此类推。
You could do this by indenting the second with
and for
under the first for
. 你可以通过缩进第二要这样做
with
和for
第一下for
。 But it's probably simpler to just read them all at once (and probably more efficient, although I doubt that matters): 但是,一次只读它们可能更简单(可能更有效,尽管我怀疑这很重要):
with open('urls.txt') as f:
urls = [line.rstrip() for line in f]
with open('proxies.txt') as proxies:
for line in proxies:
proxy = json.loads(line)
proxy_handler = urllib2.ProxyHandler(proxy)
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
for url in urls:
data = urllib.urlopen(url).read()
# do something with data
Of course this means reading the whole list of URLs before doing any work. 当然,这意味着在完成任何工作之前阅读整个URL列表。 I doubt that will matter, but if it does, you can use the
tee
trick to avoid it. 我怀疑这很重要,但如果确实如此,你可以使用
tee
技巧来避免它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.