简体   繁体   中英

Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format:

def get_clean_text(text):
    return text.translate(maketrans("!?,.;():", "        ")).lower().strip()

for track in json["tracks"]:
    print track["name"].lower()
    get_clean_text(track["name"].lower())

For the string "türlich, türlich (sicher, dicker)" I then get

File "main.py", line 23, in get_clean_text

 return text.translate(maketrans("!?,.;():", " ")).lower().strip() 

TypeError: character mapping must return integer, None or unicode

I want to format the string to be "türlich türlich sicher dicker".

The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example . (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)

Assuming you're using 2.x, and you've done a from string import * to get maketrans , and json["name"] is unicode rather than str/bytes, here's your problem:

There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).

The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate .

You can always write your own "makeunitrans" function as a drop-in replacement, something like this:

def makeunitrans(frm, to):
  return {ord(f):ord(t) for (f,t) in zip(frm, to)}

But if you just want to map out certain characters, you could do something a bit more special purpose:

def makeunitrans(frm):
  return {ord(f):ord(' ') for f in frm}

However, from your final comment, I'm not sure translate is even what you want:

I want to format the string to be "türlich türlich sicher dicker"

If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.

With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, eg, calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM