简体   繁体   English

在Python中将Unicode映射为ASCII

[英]Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format: 通过urlopen以JSON格式查询后,我收到字符串:

def get_clean_text(text):
    return text.translate(maketrans("!?,.;():", "        ")).lower().strip()

for track in json["tracks"]:
    print track["name"].lower()
    get_clean_text(track["name"].lower())

For the string "türlich, türlich (sicher, dicker)" I then get 对于字符串“türlich,türlich(sicher,dicker)”,我得到了

File "main.py", line 23, in get_clean_text 文件“ main.py”,第23行,位于get_clean_text中

 return text.translate(maketrans("!?,.;():", " ")).lower().strip() 

TypeError: character mapping must return integer, None or unicode TypeError:字符映射必须返回整数,None或unicode

I want to format the string to be "türlich türlich sicher dicker". 我想将字符串格式化为“türlichtürlichsicher dicker”。

The question is not a complete self-contained example; 这个问题不是一个完整的独立例子。 I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example . 我不确定是Python 2还是3, maketrans来源等等。我很可能会猜错,这就是为什么您应该确保正确标记问题并提供简短,自包含,正确的原因例子 (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.) (那个事实以及其他人(其中一些人可能比我聪明)的事实很可能忽略了您的问题,因为它模棱两可。)

Assuming you're using 2.x, and you've done a from string import * to get maketrans , and json["name"] is unicode rather than str/bytes, here's your problem: 假设您使用的是2.x,并且已经完成了from string import *来获取maketrans ,并且json["name"]是unicode而不是str / bytes,这是您的问题:

There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). 转换表有两种:旧式8位表(仅包含256个字符)和新式稀疏表(仅是将一个字符的序数映射到另一个字符的字典)。 The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit). str.translate函数可以使用其中任何一个,但是unicode.translate仅可以使用第二个(出于unicode.translate原因,如果您仔细考虑一下,它应该很明显)。

The string.maketrans function makes old-style 8-bit translation tables. string.maketrans函数生成旧式的8位转换表。 So you can't use it with unicode.translate . 因此,您不能将其与unicode.translate一起unicode.translate

You can always write your own "makeunitrans" function as a drop-in replacement, something like this: 您始终可以编写自己的“ makeunitrans”函数作为替代品,如下所示:

def makeunitrans(frm, to):
  return {ord(f):ord(t) for (f,t) in zip(frm, to)}

But if you just want to map out certain characters, you could do something a bit more special purpose: 但是,如果您只想映射某些字符,则可以做一些更特殊的用途:

def makeunitrans(frm):
  return {ord(f):ord(' ') for f in frm}

However, from your final comment, I'm not sure translate is even what you want: 但是,根据您的最终评论,我不确定translate是否就是您想要的:

I want to format the string to be "türlich türlich sicher dicker" 我想将字符串格式化为“türlichtürlichsicher dicker”

If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing. 如果做对了,您将把字符串格式化为“türlichtürlichsicher dicker”,因为您要将所有这些标点符号映射到空格,而不是没有空格。

With new-style translation tables you can map anything you want to None, which solves that problem. 使用新型翻译表,您可以将任何内容映射到“无”,从而解决了该问题。 But you might want to step back and ask why you're using the translate method in the first place instead of, eg, calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression. 但是您可能想退后一步,问一问为什么要首先使用translate方法,而不是例如多次调用replace (人们通常会说“为了性能”,但是您不会在-每次都行(如果有问题的话),或使用琐碎的正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM