在Python中读取“原始”Unicode字符串

Question

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question. 我对Python很陌生，所以我的问题可能很愚蠢，但即使阅读很多主题，我也没有找到问题的答案。

I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format. 我有一个混合源文档，其中包含html，xml，latex和其他textformats，我尝试进入仅限乳胶的格式。

Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. 因此，我使用python将不同的命令识别为常规表达式，并用适当的latex命令替换它们。 Everything has worked out fine so far. 到目前为止，一切都很顺利。

Now I am left with some "raw-type" Unicode signs, such as the greek letters. 现在我留下了一些“原始类型”的Unicode标志，例如希腊字母。 Unfortunaltly is just about to much to do it by hand. 不幸的是，手工做很多事情。 Therefore, I am looking for a way to do this the smart way too. 因此，我正在寻找一种以聪明的方式做到这一点的方法。 Is there a way for Python to recognise / read them? 有没有办法让Python识别/阅读它们？ And how do I tell python to recognise / read eg Pi written as a Greek letter? 我如何告诉python识别/阅读例如写为希腊字母的Pi？

A minimal example of the code I use is: 我使用的代码的最小示例是：

fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()

new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()

I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows. 我不确定它是否是一个重要信息，但我使用的是在Windows上运行的Python 2.6。

I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. 我真的很高兴，如果有人能够给我提示，至少在哪里可以找到相关信息或者这可能如何起作用。 Or whether I am completely wrong, and Python can't do this job ... 或者我是否完全错了，Python无法完成这项工作......

Many thanks in advance. 提前谢谢了。
Cheers, 干杯，
Britta 布丽塔

Answer 1

You talk of ``raw'' Unicode strings. 你谈到``raw''Unicode字符串。 What does that mean? 这意味着什么？ Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel). Unicode本身不是一种编码，但存在不同的编码来存储Unicode字符（请阅读Joel的这篇文章）。

The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, eg UTF-8 (a very common way to encode Unicode). Python 3.0中的open函数采用可选的encoding参数，允许您指定编码，例如UTF-8（一种非常常见的Unicode编码方式）。 In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file. 在Python 2.x中，看看编解码器模块，它还提供了一个允许指定文件编码的开放函数。

Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top: 编辑：或者，为什么不让那些可怜的角色，并在顶部指定您的LaTeX文件的编码：

\usepackage[utf8]{inputenc}

(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x , though) （我从来没有试过这个，但我认为它应该可行。你可能需要用utf8x替换utf8 ，但是）

Answer 2

Please, first, read this: 请首先阅读：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 绝对最低每个软件开发人员绝对必须知道Unicode和字符集（没有借口！）

Then, come back and ask questions. 然后，回来问问题。

Answer 3

You need to determine the "encoding" of the input document. 您需要确定输入文档的“编码”。 Unicode can encode millions of characters but files can only story 8-bit values (0-255). Unicode可以编码数百万个字符，但文件只能记录8位值（0-255）。 So the Unicode text must be encoded in some way. 因此必须以某种方式对Unicode文本进行编码。

If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). 如果文档是XML，则它应该在第一行（encoding =“...”;如果没有“encoding”字段，则“utf-8”是默认值）。 For HTML, look for "charset". 对于HTML，请查找“charset”。

If all else fails, open the document in an editor where you can set the encoding ( jEdit , for example). 如果所有其他方法都失败了，请在编辑器中打开文档，您可以在其中设置编码（例如， jEdit ）。 Try them until the text looks right. 尝试它们直到文本看起来正确。 Then use this value as the encoding parameter for codecs.open() in Python. 然后使用此值作为Python中codecs.open()的encoding参数。

在Python中读取“原始”Unicode字符串

问题描述

3 个解决方案

解决方案1
3 2009-05-26 10:09:09

解决方案2
1 2009-05-26 10:42:40

解决方案3
0 2009-05-26 10:39:30

在Python中读取“原始”Unicode字符串

问题描述

3 个解决方案

解决方案1 3 2009-05-26 10:09:09

解决方案2 1 2009-05-26 10:42:40

解决方案3 0 2009-05-26 10:39:30

解决方案1
3 2009-05-26 10:09:09

解决方案2
1 2009-05-26 10:42:40

解决方案3
0 2009-05-26 10:39:30