简体   繁体   English

在Python中读取“原始”Unicode字符串

[英]Reading “raw” Unicode-strings in Python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question. 我对Python很陌生,所以我的问题可能很愚蠢,但即使阅读很多主题,我也没有找到问题的答案。

I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format. 我有一个混合源文档,其中包含html,xml,latex和其他textformats,我尝试进入仅限乳胶的格式。

Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. 因此,我使用python将不同的命令识别为常规表达式,并用适当的latex命令替换它们。 Everything has worked out fine so far. 到目前为止,一切都很顺利。

Now I am left with some "raw-type" Unicode signs, such as the greek letters. 现在我留下了一些“原始类型”的Unicode标志,例如希腊字母。 Unfortunaltly is just about to much to do it by hand. 不幸的是,手工做很多事情。 Therefore, I am looking for a way to do this the smart way too. 因此,我正在寻找一种以聪明的方式做到这一点的方法。 Is there a way for Python to recognise / read them? 有没有办法让Python识别/阅读它们? And how do I tell python to recognise / read eg Pi written as a Greek letter? 我如何告诉python识别/阅读例如写为希腊字母的Pi?

A minimal example of the code I use is: 我使用的代码的最小示例是:

fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()

new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()

I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows. 我不确定它是否是一个重要信息,但我使用的是在Windows上运行的Python 2.6。

I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. 我真的很高兴,如果有人能够给我提示,至少在哪里可以找到相关信息或者这可能如何起作用。 Or whether I am completely wrong, and Python can't do this job ... 或者我是否完全错了,Python无法完成这项工作......

Many thanks in advance. 提前谢谢了。
Cheers, 干杯,
Britta 布丽塔

You talk of ``raw'' Unicode strings. 你谈到``raw''Unicode字符串。 What does that mean? 这意味着什么? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel). Unicode本身不是一种编码,但存在不同的编码来存储Unicode字符(请阅读Joel的这篇文章 )。

The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, eg UTF-8 (a very common way to encode Unicode). Python 3.0中的open函数采用可选的encoding参数,允许您指定编码,例如UTF-8(一种非常常见的Unicode编码方式)。 In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file. 在Python 2.x中,看看编解码器模块,它还提供了一个允许指定文件编码的开放函数。

Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top: 编辑:或者,为什么不让那些可怜的角色,并在顶部指定您的LaTeX文件的编码:

\usepackage[utf8]{inputenc}

(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x , though) (我从来没有试过这个,但我认为它应该可行。你可能需要用utf8x替换utf8 ,但是)

You need to determine the "encoding" of the input document. 您需要确定输入文档的“编码”。 Unicode can encode millions of characters but files can only story 8-bit values (0-255). Unicode可以编码数百万个字符,但文件只能记录8位值(0-255)。 So the Unicode text must be encoded in some way. 因此必须以某种方式对Unicode文本进行编码。

If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). 如果文档是XML,则它应该在第一行(encoding =“...”;如果没有“encoding”字段,则“utf-8”是默认值)。 For HTML, look for "charset". 对于HTML,请查找“charset”。

If all else fails, open the document in an editor where you can set the encoding ( jEdit , for example). 如果所有其他方法都失败了,请在编辑器中打开文档,您可以在其中设置编码(例如, jEdit )。 Try them until the text looks right. 尝试它们直到文本看起来正确。 Then use this value as the encoding parameter for codecs.open() in Python. 然后使用此值作为Python中codecs.open()encoding参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM