简体   繁体   English

Python:URL用拉丁字符编码URL

[英]Python: URL encode urls with latin characters

I have many entities in data base with an "url" attribute, the url attribute in so many records is hardcoded, ie containig latin characters, which doesn't work in Firefox (the urls are for song files stored in s3 and I play them with soundmanager2). 我在数据库中有许多具有“ url”属性的实体,许多记录中的url属性都是硬编码的,即包含拉丁字符,在Firefox中不起作用(URL用于存储在s3中的歌曲文件,我会播放它们与soundmanager2)。

Example: 例:

url with latin character "ó": https://something.s3.amazonaws.com/music/something/thisó.mp3

If I replace "ó" with its utf8 code "%c3%b3" then https://something.s3.amazonaws.com/music/something/this%c3%b3.mp3 works 如果我用其utf8代码“%c3%b3”替换“ó”,则https://something.s3.amazonaws.com/music/something/this%c3%b3.mp3可以正常工作

I would like to replace all latin and special character with their url encoding utf8 codes based on this chart 我想根据此图表将所有拉丁和特殊字符替换为其url编码utf8代码

As asked by @albert, I'm posting the solution I found. 按照@albert的要求,我正在发布找到的解决方案。 Using "urllib"'s "quote" methode, you can encode latin and charactes like " ", "(" and all other special characters. Since "quote" will convert "http:" to "http%3A" which is not desired, It was mandatory to split the url and only convert the wanted part. Another thing that you should consider is if the urls already partially or completely encoded, in this case, the url may contain some utf8 coded characters, which would contain "%", the quote will proceed "%" as a special character and wil convert it to "%25" which will mess the urls to non returning mess ! 使用“ urllib”的“ quote”方法,您可以编码拉丁字母和诸如“”,“(”和所有其他特殊字符的字符。由于“ quote”会将“ http:”转换为“ http%3A”,因此不希望使用,则必须拆分网址并仅转换所需的部分。您还应考虑的另一件事是,如果网址已经部分或完全编码,则在这种情况下,该网址可能包含一些utf8编码字符,其中将包含“%” ,引号将以“%”作为特殊字符继续,并将其转换为“%25”,这会将URL弄乱成不可返回的烂摊子!

Example of the case: 案例示例:

If the url is url = "http://something/cóntaining space song name.mp3"

If the url is already partially encoded (eg " " will be "%20"), then the current url may look like this 如果该网址已被部分编码(例如,“”将为“%20”),则当前网址可能如下所示

url = " http://something/cóntaining%20space%20song%20name.mp3 " url =“ http://something/cóntaining%20space%20song%20name.mp3

urllib.quote(url) will give (let's assume that "http:" is not converted to "http:%3A") the urllib.quote will give: urllib.quote(url)将给出(假设“ http:”未转换为“ http:%3A”),urllib.quote将给出:

" http://something/c%C3%B3ntaining%2520space%2520song%2520name.mp3 " http://something/c%C3%B3ntaining%2520space%2520song%2520name.mp3

The result is a mess ! 结果是一团糟!

With that being said; 话虽这么说; we can't split the url into "http:" and the rest of it and then apply "quote" to the second part of the url. 我们无法将网址分为“ http:”和其余部分,然后将“ quote”应用于网址的第二部分。

So the solution; 所以解决方案; Encode these special characters one by one; 将这些特殊字符一一编码; replace each latin or special character with its utf code. 用其utf代码替换每个拉丁字符或特殊字符。 Then comes the question "How ?" 然后是问题“如何?”

It is painful to try if each url contains a character of a list made of these characters (another thing, if the url is unicode you can't use url.find("ó")), Then here comes the tricks ! 尝试每个URL包含由这些字符组成的列表中的一个字符是一件很痛苦的事情(另一件事,如果url是unicode,则不能使用url.find(“ó”)),然后就来了! The problem is the solution ! 问题是解决方法!

Finding the latin and the special characters ! 查找拉丁语和特殊字符! how to find them ?! 如何找到他们? WITH THE EXCEPTION ! 有这个特例 !

If urls (containing bad characters) are of type "unicode" converting them to string will raise an exception 如果网址(包含错误字符)的类型为“ unicode”,则将其转换为字符串会引发异常

If the urls (containing bad characters) are of type "str" converting them to unicode will raise an exception 如果网址(包含错误字符)的类型为“ str”,则将其转换为unicode会引发异常

We find the wanted characters with the exception ;-) 我们找到所需的字符,除了;-)

Then split the url at the position of that character, quote the charcters and at the end rebuild the url. 然后在该字符的位置分割网址,引用字符,最后重建网址。

For my case, urls are unicode: 就我而言,网址是unicode:

import sys
import urllib

from core.models import Song


songs = Song.objects.all()

for song in songs:
    try:
        x = str(song.song_url) #will cause exception with urls containing bad characters
    except(UnicodeEncodeError):
        k = sys.exc_info()
        pos = k[1][2] #getting the position of the bad character
        c = song.song_url[pos].encode("utf8")
        q =  urllib.quote(c)
        p1 = song.song_url[:pos] #splitted part one
        p2 = song.song_url[pos+1:] #splitted part two
        res = p1 + q + p2 #rebuit url
        song.song_url = res
        song.save()
        print res

Note if the url contains several "bad" characters, the above code will treat the first one in each url, so whether execute it in a recursive manner or run it several times until you get no ouput. 请注意,如果该URL包含几个“坏”字符,则上面的代码将处理每个URL中的第一个字符,因此,是以递归方式执行还是多次运行,直到没有输出为止。 I wish this helps. 我希望这会有所帮助。

Generic example where url is of type "str": 网址为“ str”类型的通用示例:

import sys
import urllib

url = "https://something.s3.amazonaws.com/music/something/thisó.mp3"

try:
    x = unicode(url)
except(UnicodeDecodeError):
    k = sys.exc_info()
    pos = k[1][2]
    url2 = url.decode('utf8')
    c = url2[pos].encode("utf8")
    q =  urllib.quote(c)
    p1 = url2[:pos]
    p2 = url2[pos+1:]
    res = p1 + q + p2
    print res

I wish the solution is helpful for anyone who come accross. 我希望该解决方案对遇到的任何人都有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何用urllib3 request_encode_url编码拉丁字符? - How to encode latin characters with urllib3 request_encode_url? Python:UnicodeEncodeError:'latin-1'编解码器无法对字符进行编码 - Python: UnicodeEncodeError: 'latin-1' codec can't encode characters in position Python使用特殊字符编码url - Python encode url with special characters Python拉丁字符和Unicode - Python Latin Characters and Unicode “latin-1”编解码器无法编码字符 - 'latin-1' codec can't encode characters Python'latin-1'编解码器无法编码字符-如何忽略字符? - Python 'latin-1' codec can't encode character - How to ignore characters? 如何在Python中编码/解码此BeautifulSoup字符串,以便输出非标准拉丁字符? - How do I encode/decode this BeautifulSoup string in Python so that non-standard Latin characters are output? Python:UnicodeEncodeError:“latin-1”编解码器无法编码 position 3-4 中的字符:序数不在范围内(256) - Python: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 3-4: ordinal not in range(256) Unicode编码错误:Python中的“拉丁-1” - Unicode encode error: 'latin-1' in Python Python 删除非拉丁字符 - Python Removing Non Latin Characters
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM