简体   繁体   English

替换 Unicode 字符 / Python / Django

[英]Replacing Unicode character / Python / Django

["

Since I'm pretty much forced to replace some unicode characters in my string returned by some OCR technology the only way I found to do it is replace them "one by one".<\/i>由于我几乎被迫替换某些 OCR 技术返回的字符串中的一些 unicode 字符,因此我发现这样做的唯一方法是“一个接一个地”替换它们。<\/b> This is done using following code:<\/i>这是使用以下代码完成的:<\/b><\/p>

def recode(mystr):
    mystr = mystr.replace(r'\u0104', '\u0104')
    mystr = mystr.replace(r'\u017c', '\u017c')
    mystr = mystr.replace(r'\u0106' , '\u0106')
    ...
    ...
    mystr = mystr.replace(r'\u017a' , '\u017a')
    mystr = mystr.replace(r'\u017c' , '\u017c')
    return mystr
["

So the reason why foo<\/code> is not read as raw text is that the r<\/code> in front of a string only plays a role when the string is created<\/em> - afterwards it will act as a normal string - for example when the %<\/code> -operator is applied.<\/i>所以foo<\/code>不被读取为原始文本的原因是字符串前面的r<\/code>仅在创建<\/em>字符串时起作用 - 之后它将充当普通字符串 - 例如在应用%<\/code>运算符时。<\/b><\/p>

As a solution to what you want to do, you can try something like this:<\/i>作为您想要做的事情的解决方案,您可以尝试这样的事情:<\/b><\/p>

bar = r"\u0104"
mystr = mystr.replace(bar, chr(int(bar[2:], 16)))
["

This is an XY problem.<\/i>这是一个 XY 问题。<\/b> The API is returning literal Unicode strings.<\/i> API 正在返回文字 Unicode 字符串。<\/b> Maybe it is actually JSON and OP should be doing json.loads()<\/code> on the returned data, but if not you can use the unicode_escape<\/code> codec to translate the escape codes.<\/i>也许它实际上是 JSON 并且 OP 应该对返回的数据执行json.loads()<\/code> ,但如果不是,您可以使用unicode_escape<\/code>编解码器来翻译转义码。<\/b> That codec requires a byte string so it may need to be encoded via ascii<\/code> or latin1<\/code> first:<\/i>该编解码器需要一个字节字符串,因此可能需要首先通过ascii<\/code>或latin1<\/code>对其进行编码:<\/b><\/p>

def recode(mystr):
    mystr = mystr.replace(r'\u0104', '\u0104')
    mystr = mystr.replace(r'\u017c', '\u017c')
    mystr = mystr.replace(r'\u0106' , '\u0106')
    mystr = mystr.replace(r'\u017a' , '\u017a')
    mystr = mystr.replace(r'\u017c' , '\u017c')
    return mystr

def recode2(s):
    return s.encode('latin1').decode('unicode_escape')

s = r'\u0104\u017c\u0106\u017a\u017c'
print(s)
print(recode(s))
print(recode2(s))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM