简体   繁体   中英

Search and replace characters in a file with Python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).

The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र' !).

All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:

  • each key must be an integer , the codepoint of a Unicode character; for example, 0x0904 is the codepoint for , AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf ).

  • the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, eg u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).

Characters that aren't found as keys in the dict are passed on untouched from the input to the output.

Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (eg, the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site ).

Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.

Something like this:

for syllable in input_text.split_into_syllables():
    output_file.write(d[syllable])

Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM