简体   繁体   English

如何在python中将unicode字符串转换为文字字符串?

[英]How to convert a unicode string to a literal string in Python?

Here are a few examples (unicode) string: 以下是一些示例(unicode)字符串:

a = u'\u03c3\u03c4\u03b7\u03bd \u03a0\u03bb\u03b1\u03c4\u03b5\u03af\u03b1 \u03c4\u03bf\u03c5'
b = u'\u010deprav so mu doma\u010di in strici duhovniki odtegovali denarno pomo\u010d . Kljub temu mu je uspelo'
c = u'sovi\xe9ticas excepto Georgia , inclusive las 3 rep\xfablicas que hab\xedan'

My end goal is to split on the backslashes (and spaces), so that it looks like this: 我的最终目标是分割反斜杠(和空格),使其看起来像这样:

split_a = [u03c3, u03c4, u03b7, u03bd, ,u03a0, u03bb, u03b1, u03c4, u03b5, u03af, u03b1, ,u03c4, u03bf, u03c5]
split_b = ['', 'u010deprav', 'so', 'mu', 'doma', 'u010di', 'in', 'strici',  'duhovniki' odtegovali denarno pomo', 'u010d', '.', 'Kljub', 'temu', 'mu', 'je', 'uspelo']
split_c = ['sovi', 'xe9ticas', 'excepto', 'Georgia', ',', 'inclusive', 'las', '3',  'rep', 'xfablicas', 'que', 'hab', 'xedan']

(The empty places where there is both a space and a backslash are totally fine). (在有空格和反斜杠的空白处完全可以)。

When I try to split using this: 当我尝试使用此方法拆分时:

a.split("\\\\") , it doesn't change the string at all. a.split("\\\\") ,它根本不会更改字符串。

I saw this example here , which makes me think that I need to make my strings literal strings (using r ). 我在这里看到了这个示例,这使我认为我需要将字符串设置为文字字符串(使用r )。 However, I don't know how to convert my large list of strings into all literal strings. 但是,我不知道如何将大量字符串转换为所有文字字符串。

When I searched on that, I got here . 当我进行搜索时,我来到了这里 However, my compiler throws an error when I run a.encode('latin-1').decode('utf-8') . 但是,运行a.encode('latin-1').decode('utf-8')时,编译器会引发错误。 The error it throws is 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256) 它引发的错误是'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

So, my question is: How can I take a list of unicode strings, programmatically iterate through them and make them string literals, and then split on a backslash? 因此,我的问题是:如何获取unicode字符串列表,以编程方式遍历它们,并使它们成为字符串文字,然后在反斜杠上进行拆分?

You have a Unicode string, which already has one Unicode codepoint per string element. 您有一个Unicode字符串,每个字符串元素已经有一个Unicode代码点。 The '\\\\' is just the representation of the string that is printed to the console, it's not the actual contents. '\\\\'只是打印到控制台的字符串的表示形式,而不是实际的内容。

To make a list of numbers out of it is actually quite easy: 列出其中的数字实际上很容易:

split_a = [ord(c) for c in a]

If you need to make a bunch of strings consisting of the letter u followed by the hex value, that's only slightly more complicated: 如果您需要制作一串由字母u和十六进制值组成的字符串,那只会稍微复杂一点:

split_a = ', '.join('u' + ('%04x' % ord(c)) for c in a)

You can use the unicode_escape code to translate a unicode string to its escaped representation. 您可以使用unicode_escape代码将unicode字符串转换为其转义的表示形式。

split_a = a.encode('unicode_escape').split('\\')

outputs: 输出:

['',
 'u03c3',
 'u03c4',
 'u03b7',
 'u03bd ',
 'u03a0',
 'u03bb',
 'u03b1',
 'u03c4',
 'u03b5',
 'u03af',
 'u03b1 ',
 'u03c4',
 'u03bf',
 'u03c5']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM