简体   繁体   English

如何在 python 中编码特殊字符

[英]How to encode special characters in python

So I have a python script.所以我有一个 python 脚本。 I invoke the script like this:我像这样调用脚本:

cat file.txt | python myscript.py

What I am having trouble with is encoding special characters in my script.我遇到的问题是在我的脚本中编码特殊字符。 I can't seem to find a simple one line way to do this.我似乎找不到一种简单的单行方式来做到这一点。 My one thought was to have the encoding happen when I read the file in standard in.我的一个想法是在我以标准格式读取文件时进行编码。

 //How would I encode file.txt to read non-english string?
    multilineList = sys.stdin.read().splitlines()

Specifically, here are the strings I am having trouble with.具体来说,这是我遇到问题的字符串。 Shining the python bat signal here for help.在此处发出 python 蝙蝠信号以寻求帮助。 For the life of me, can't figure it out.对于我的生活,无法弄清楚。 Thanks again for the help: PFB:再次感谢您的帮助:PFB:

('z\xc3\xa1mky')
('z\xc3\xa8ne')
('z\xc5\x82o\xc5\x9b\xc4\x87')
('\xc3\x81sv\xc3\xa1nyr\xc3\xa1r\xc3\xb3')
('\xc3\x84ltasj\xc3\xb6n')
('\xc3\x87etin')
('\xc3\x89')
('\xc3\x89chevanne')
('\xc3\x89mile')
('\xc3\x89milie')
('\xc3\x89phrem')
('\xc3\x89quemauville')
('\xc3\x89tat')
('\xc3\x89vrange')
('\xc3\x93g')
('\xc3\x96gmundar')
('\xc3\x96ljeit\xc3\xbc')
('\xc3\x96ster\xc3\xa5keranstalten')
('\xc3\x98gl\xc3\xa6nd')
('\xc3\x98stfold')
('\xc3\x9c\xc3\xa7')
('\xc3\xa7a')
('\xc3\xbc')
('\xc3\xbe\xc3\xa1ttr')
('\xc4\x80ka\xc5\x9ba')
('\xc4\x86ati\xc4\x87')
('\xc4\x86uk')
('\xc4\x86wik')
('\xc4\x8crmo\xc5\xa1njice')
('\xc4\x90uc')
('\xc4\x90\xe1\xba\xb7ng')
('\xc4\xa0azzah')
('\xc4\xb0lhan')
('\xc4\xb0sm\xc9\x99tli')
('\xc4\xb2')
('\xc5\x81azarz')
('\xc5\x81ojewek')
('\xc5\x81om\xc5\xbca')
('\xc5\x81uk\xc3\xb3w')
('\xc5\x8ci')
('\xc5\x8cshima')
('\xc5\x9awiatope\xc5\x82k')
('\xc5\x9awi\xc4\x99tokrzyskie')
('\xc5\xa0karoupka')
('\xc5\xa0\xc3\xbatovo')
('\xc5\xbberomin')
('\xc5\xbdeljko')
('\xc5\xbdupanja')
('\xc6\x8flib\xc9\x99yq\xc4\xb1\xc5\x9flaq')
('\xc6\x90n')
('\xc6\xba')
('\xca\xbbaqiva')
('\xce\x9b')
('\xce\x9csa')
('\xce\x9d')
('\xce\xa4')
('\xce\xb1')
('\xd0\x91\xd0\xbe\xd1\x80\xd0\xb8\xd1\x81')
('\xd0\x91\xd1\x8f\xd0\xb3\xd1\x81\xd1\x82\xd0\xb2\xd0\xbe')
('\xd0\x95\xd1\x84\xd0\xb8\xd0\xbc\xd0\xbe\xd0\xb2')
('\xd0\x95\xd1\x84\xd0\xb8\xd0\xbc\xd0\xbe\xd0\xb2\xd0\xb8\xd1\x87')
('\xd0\xa2\xd1\x83\xd0\xbb\xd1\x8c\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f')
('\xd0\xb7\xd0\xb0\xd1\x82\xd0\xb2\xd0\xbe\xd1\x80\xd0\xb0')
('\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb0\xd1\x81\xd1\x82\xd1\x8c')
('\xd0\xbe\xd1\x82')
('\xd7\x99\xd7\x94\xd7\x93\xd7\x95\xd7\xaa')
('\xd7\x9b\xd7\x95\xd7\xa8\xd7\x93\xd7\x99\xd7\xa1\xd7\xaa\xd7\x90\xd7\x9f')
('\xd7\xa6')
('\xd8\xaa\xd9\x87\xd8\xb1\xd8\xa7\xd9\x86')
('\xd8\xaf\xd8\xa7\xd8\xb1\xd9\x81\xd9\x88\xd8\xb1')
('\xd8\xba\xd8\xb1\xd8\xa8')
('\xdb\xb8')
('\xe2\x80\x8bkinenk\xc5\x8den')
('\xe2\x80\x93')
('\xe2\x80\x98abd')
('\xe2\x80\x9cbaldy\xe2\x80\x9d')
('\xe2\x8a\xbf')
('\xe3\x81\x82\xe3\x81\x97\xe3\x81\x9f\xe3\x81\xae\xe3\x82\xb8\xe3\x83\xa7\xe3\x83\xbc')
('\xe3\x81\xa2\xe3\x81\x90\xe3\x82\x8c\xe3\x81\x84\xe3\x82\x93')
('\xe3\x82\x92')
('\xe3\x82\xa6\xe3\x83\xa9\xe3\x82\xb8\xe3\x82\xaa')
('\xe3\x83\xa2\xe3\x82\xb9\xe3\x83\xa9')
('\xe3\x8e\x90')
('\xe3\x8e\xa2')
('\xe5\x94\xb5')
('\xe5\x9c\x9c')
('\xe5\xbe\x90\xe5\xb7\x9e')
('\xe6\x9d\xb0\xe7\x90\x86\xe6\x98\x8e')
('\xe6\xb3\xb0\xe7\x8e\x8b')
('\xe9\xbe\x9f')
('\xea\xb9\x80\xec\x9d\xbc\xec\x84\xb1')

One option is to explictly decode the input if you know the encoding to be UTF-8, as it appears to be in your sample:如果您知道编码为 UTF-8,则一种选择是显式解码输入,因为它似乎在您的示例中:

import sys

stdin_input = sys.stdin.read()
stdin_input = stdin_input.decode("utf-8")
multilineList = stdin_input.splitlines()
for line in multilineList:
    print(line)

you can encode strings with.encode() and decode them with.decode(), eg您可以使用.encode() 对字符串进行编码并使用.decode() 对其进行解码,例如

In [3]: c = 'šÖmè štrÍñg'
   ...: print(c)
   ...: 
   ...: enc = 'šÖmè štrÍñg'.encode(encoding='UTF-8')
   ...: print(enc)
   ...: 
   ...: print(enc.decode(encoding='UTF-8'))
šÖmè štrÍñg
b'\xc5\xa1\xc3\x96m\xc3\xa8 \xc5\xa1tr\xc3\x8d\xc3\xb1g'
šÖmè štrÍñg

It looks like your data may be encoded using UTF-16 instead of the Python default encoding (which is usually utf-8).看起来您的数据可能使用 UTF-16 而不是 Python 默认编码(通常是 utf-8)进行编码。 Most Eastern-language text is encoded in UTF-16 these days, so that's not surprising.如今,大多数东方语言文本都以 UTF-16 编码,所以这并不奇怪。 You can fix it by changing the encoding to UTF-16.您可以通过将编码更改为 UTF-16 来修复它。

Python2: Python2:

multilineList = sys.stdin.read().decode('utf-16').splitlines()

Python3: Python3:

multilineList = sys.stdin.buffer.read().decode('utf-16').splitlines()

Edit: Noticed you're using Python 2.7, so gave code that works in both Python2 and Python3.编辑:注意到您使用的是 Python 2.7,因此给出了适用于 Python2 和 Python3 的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM