简体   繁体   English

使用Python搜索和替换文件中的字符

[英]Search and replace characters in a file with Python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. 我正在尝试进行音译,我需要用一个文件替换英语中的每个源字符,该文件与我在与Unicode格式的另一种语言对应的源代码中使用的字典相当。 I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. 我现在能够从英文文件中逐个字符地读取如何在源代码中定义的字典中搜索其等效地图,并确保在新的音译输出文件中打印。 Thank you:). 谢谢:)。

The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. Unicode对象的translate方法是执行所需音译的最简单,最快捷的方法。 (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र' !). (我假设您使用的是Unicode,而不是普通的字节字符串,因此无法使用'पत्र'等字符!)。

All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you: 您所要做的就是以精确的方式布置音译词典,如我向您指出的文档中所指定的:

  • each key must be an integer , the codepoint of a Unicode character; 每个键必须是一个整数 ,Unicode字符的代码点 ; for example, 0x0904 is the codepoint for , AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). 例如,0x0904是 ,AKA“DEVANAGARI LETTER SHORT A”的代码点,因此对于音译,您将使用整数0x0904(相当于十进制2308)作为dict中的键。 (For a table with the codepoints for many South-Asian scripts, see this pdf ). (对于包含许多南亚脚本的代码点的表,请参阅此pdf )。

  • the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, eg u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character). 相应的值可以是Unicode序数,Unicode字符串(大概是你将用于音译任务的字符串,例如u'a'如果你想将梵文字母短A音译成英文字母'a', u'a' '),或者无(如果在“音译”期间您只想删除该Unicode字符的实例)。

Characters that aren't found as keys in the dict are passed on untouched from the input to the output. 在dict中未找到键的字符将从输入传递到输出。

Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. 一旦你的dict被这样布局, output_text = input_text.translate(thedict)为你完成所有的音译 - 而且相当快。 You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (eg, the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site ). 您可以将此应用于任何大小的Unicode文本块,这些块可以很好地适应内存 - 基本上只需要一个文本文件就可以在大多数机器上完成(例如,精彩 - 和巨大 - Mahabharata最多需要任何可免费下载的形式都有几十兆字节 - 梵文[[与天城文和罗马音译形式交叉链接]],英文翻译 - 可从本网站获得

Note: Updated after clarifications from questioner. 注意:在提问者澄清后更新。 Please read the comments from the OP attached to this answer. 请阅读本答案附带的OP的评论。

Something like this: 像这样的东西:

for syllable in input_text.split_into_syllables():
    output_file.write(d[syllable])

Here output_file is a file object, open for writing. 这里output_file是一个文件对象,可以写入。 d is a dictionary where the indexes are your source characters and the values are the output characters. d是一个字典,其中索引是源字符,值是输出字符。 You can also try to read your file line-by-line instead of reading it all in at once. 您也可以尝试逐行读取文件,而不是一次性读取所有文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM