简体   繁体   English

如何在python(不是UTF-8)中读取unicode文件

[英]How to read unicode files in python (not UTF-8)

How to read unicode files in python 2.x (not UTF-8, unknown encoding) 如何在python 2.x读取unicode文件(不是UTF-8,未知编码)

I tried to find a way to read unicode files. 我试图找到一种读取unicode文件的方法。 I searched on the Internet for a long long long time. 我在互联网上搜索了很长时间。 But I can't find it. 但是我找不到。 What I found are the way to read files such as encoded as UTF-8. 我发现是读取文件的方式,例如编码为UTF-8的文件。 I know, that when I need to read UTF-8, I can use codecs. 我知道,当我需要阅读UTF-8时,可以使用编解码器。

codecs.open('unicode2.txt',encoding='utf-8')

Using this I can read UTF-8 files. 使用此工具,我可以读取UTF-8文件。 But I want to know how to read unicode files. 但是我想知道如何读取unicode文件。 Many many post that titled 'the way to read unicode files in python' actually tells a way to read files such as UTF-8, UTF-16. 许多题为“在python中读取unicode文件的方式”的帖子实际上讲述了一种读取文件的方式,例如UTF-8,UTF-16。

Why anyone didn't explain a way to read 'UNICODE' files? 为什么没有人解释读取“ UNICODE”文件的方式?

this is an example of hex value of text files I try to read with python. 这是我尝试使用python读取的文本文件的十六进制值的示例。

This is Korean, " 파이썬에서 한글 읽기 " 这是韩文,“ 파이썬에서 한글 읽기

(FF FE) 0C D3 74 C7 6C C3 D0 C5 1C C1 20 00 5C D5 00 AE 20 00 7D C7 30 AE

(FF FE) means byte order. (FF FE)表示字节顺序。 And each 2 byte means character. 每个2字节表示字符。 As you can see, space is written as '20 00', not '20' In unicode, space is written as '20 00'. 如您所见,空格被写为“ 20 00”,而不是“ 20”。在Unicode中,空格被写为“ 20 00”。 But UTF-8, space is written as '20'. 但是UTF-8,空格写为'20'。

There is no way to use codecs like " codecs.open('unicode2.txt',encoding='**unicode**') " 无法使用编解码器,例如“ codecs.open('unicode2.txt',encoding='**unicode**')

Is there really no way to read "unicode" files in python? 真的没有办法在python中读取“ unicode”文件吗?

A disk file is a sequence of bytes that you can interpret as a text if you use character encoding such as utf-8, utf-16le. 磁盘文件是一个字节序列,如果您使用字符编码(例如utf-8,utf-16le),则可以将其解释为文本。 "unicode" is not a character encoding. “ unicode”不是字符编码。

There Ain't No Such Thing As Plain Text . 没有明文这样的东西

Your example file might use utf-16le encoding: 您的示例文件可能使用utf-16le编码:

>>> text = u"파이썬"
>>> text.encode('utf-16le')
'\x0c\xd3t\xc7l\xc3'
>>> text.encode('utf-16le').encode('hex')
'0cd374c76cc3'

b'\\xff\\xfe' == codecs.BOM_UTF16_LE is a BOM for UTF-16 (LE) character encoding. b'\\xff\\xfe' == codecs.BOM_UTF16_LE是用于UTF-16(LE)字符编码的BOM。 To read such file, you could use utf-16 encoding (BE or LE are chosen based on BOM): 要读取此类文件,您可以使用utf-16编码(根据BOM来选择BE或LE):

import codecs

with codecs.open('filename', encoding='utf-16') as file:
    text = file.read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM