简体   繁体   English

如何通过urllib2.urlopen下载文件

[英]how to download file by urllib2.urlopen

i know how to download simple file from urrlib2.urlopen 我知道如何从urrlib2.urlopen下载简单文件

but my end URL is not simple it has special character in it : 但是我的结束URL并不简单,它具有特殊字符:

" www.math.ualberta.ca/mss/misc/A Mathematician's Apology.pdf " www.math.ualberta.ca/mss/misc/数学家的道歉.pdf

special character Mathematician's ' is in this path 特殊字符数学家的 在这条道路

some how i know that 我怎么知道

http://www.math.ualberta.ca/mss/misc/A%20Mathematician%27s%20Apology.pdf http://www.math.ualberta.ca/mss/misc/A%20Mathematician%27s%20Apology.pdf

is url i have to use to download file but if i don't have this kind of end URL with me every time 是url我必须用来下载文件,但是如果我每次都没有这种结束URL

please give me solution so that i can download file which has special URL 请给我解决方案,以便我可以下载具有特殊URL的文件

i have basic method that can be used but i don't know how to use this 我有可以使用的基本方法,但我不知道该如何使用

  1. urllib.quote(string[, safe]) urllib.quote(string [,safe])
  2. urllib.quote_plus(string[, safe]) urllib.quote_plus(string [,safe])
  3. urllib.unquote(string) urllib.unquote(字符串)
  4. urllib.unquote_plus(string) urllib.unquote_plus(string)

please help me with this method by example 请以示例的方式帮助我

thank you 谢谢

Why not use something like this? 为什么不使用这样的东西?

filename = url.split('/')[-1]
cleanurl = urllib.quote(url)
urllib.urlretrieve(cleanurl, filename)

You want to quote only the path component of the URL, not the whole thing. 您只想引用URL的路径部分,而不引用整个内容。

The cleanest way to do this is to split it into pieces with urlparse , quote the path component, and rejoin the whole thing. 最干净的方法是使用urlparse将其拆分为多个部分,引用路径组件,然后重新加入整个组件。

But as it turns out, urlparse automatically quotes the path anyway. 但事实证明, urlparse自动引用该路径。 This isn't really documented, but it's been true for every version so far, so if you're willing to rely on that, it's as simple as this: 确实没有记录,但是到目前为止,每个版本都是如此,因此,如果您愿意依靠它,就这么简单:

>>> url = "www.math.ualberta.ca/mss/misc/A Mathematician's Apology.pdf"
>>> url = urlparse.urlparse(url).geturl()
>>> url
'http://www.math.ualberta.ca/mss/misc/A%20Mathematician%27s%20Apology.pdf'

If you actually just have a host and path, you can in fact just use urllib.quote . 如果实际上只有一个主机和路径,则实际上可以使用urllib.quote With a full URL, that will quote the : character between the scheme and host, but if you don't have a scheme, as in you example, that's not a problem. 使用完整的URL,该名称将在方案和主机之间加上:字符,但是,如果您没有如您示例中所示的方案,那么这不是问题。 (Of course it will quote the stray spaces in your example too… but those are going to be a problem no matter what you do, so your first step has to be removing them.) (当然,它会引用您的例子太多了流浪空间......但这些都将是一个问题,不管你做什么 ,所以你的第一步,必须删除它们。)

>>> url = " www.math.ualberta.ca/mss/misc/A Mathematician's Apology.pdf "
>>> url = urllib.quote(url.strip())
>>> url
'www.math.ualberta.ca/mss/misc/A%20Mathematician%27s%20Apology.pdf'

You'll still need to add a scheme before this is actually useful, of course. 当然,您仍然需要在实际有用之前添加一个方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM