Replace unicode characters only once in Python

Question

I'm trying to create a small script that replaces a set of characters in a file like this:

# coding=utf-8

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": "ă",
        u"Ã": "Ă",
        u"º": "ș",
        u"ª": "Ș",
        u"þ": "ț",
        u"Þ": "Ț",
    }

    if os.path.isfile(subtitleFileName):
        oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")

        subtitleContent = oldSubtitleFile.read()
        subtitleContent = codecs.encode(subtitleContent, "utf-8")

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)

        oldSubtitleFile.close()

        newSubtitleFile = open(newSubtitleFileName, "wb")
        newSubtitleFile.write(subtitleContent)
        newSubtitleFile.close()

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

and it works ok for the first run.

So if I have a file containing Eºti sigur cã vrei sã ºtergi fiºierele? , after running the script on that file I get Ești sigur că vrei să ștergi fișierele? which is what I want. But if I run it multiple times I get:

EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

EĂÂti sigur cĂÂ vrei sĂÂ ĂÂtergi fiĂÂierele?

EÄÂĂÂti sigur cÄÂĂÂ vrei sÄÂĂÂ ÄÂĂÂtergi fiÄÂĂÂierele?

EĂÂĂÂÄÂĂÂti sigur cĂÂĂÂÄÂĂÂ vrei sĂÂĂÂÄÂĂÂ ĂÂĂÂÄÂĂÂtergi fiĂÂĂÂÄÂĂÂierele?

And I don't understand why. How does it find some characters that don't exist anymore in the file (ã, º, etc.) to be able to replace them? And why is it even replacing them with some other characters?

Answer 1

Simple - it's because on the first run you're reading ISO-8859-1 and writing UTF-8. Then on the second run you're doing exactly the same despite the input is now UTF-8 not ISO-8859-1. On subsequent runs the search and replace is no longer working.

This test mimics your 2nd iteration - Interpreting UTF-8 as ISO-8859-1 :

# -*- coding: utf-8 -*-
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1")
>> EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

The next iteration looks like:

print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1").encode("utf-8").decode("ISO-8859-1")
>> EÃÂti sigur cÃÂ vrei sÃÂ ÃÂtergi fiÃÂierele?

Heed @Daniel's advice to decode once, replace Unicode with Unicode then encode once. I've also been informed that it's best to use io.open() rather than codecs , as its Python 3 compatible and solves a problem with universal new lines.

Answer 2

Don't work with encoded content. Only encode when writing the new file:

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": u"ă",
        u"Ã": u"Ă",
        u"º": u"ș",
        u"ª": u"Ș",
        u"þ": u"ț",
        u"Þ": u"Ț",
    }

    if os.path.isfile(subtitleFileName):
        with codecs.open(subtitleFileName, "rb", "ISO-8859-1") as oldSubtitleFile:
            subtitleContent = oldSubtitleFile.read()

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(key, value)

        with codecs.open(newSubtitleFileName, "wb", "utf-8") as newSubtitleFile:
            newSubtitleFile.write(subtitleContent)

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

Answer 3

It is incorrect to use "ISO-8859-1" character encoding on "utf-8" content: the very first time you run your script it takes a text file (presumably "ISO-8859-1" encoded) and saves it as "utf-8" while replacing certain Unicode characters.

Then you run the conversion the second time then it takes "utf-8" content and tries to interpret it as "ISO-8859-1" that is wrong .

To avoid the confusion make the replacements separately from the changing of the character encoding. Thus the replacements would be idempotent.

To make the replacements, you could use fileinput module and unicode.translate() :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Replace some characters in 'iso-8859-1'-encoded files."""
import fileinput # read files given on the command-line and/or stdin

replacements = {
    u"ã": u"ă",
    u"Ã": u"Ă",
    u"º": u"ș",
    u"ª": u"Ș",
    u"þ": u"ț",
    u"Þ": u"Ț",
}
# key => ord(key)
replacements = dict(zip(map(ord, replacements.keys()), replacements.values()))
for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):
    print(line.translate(replacements))

To control the encoding of the output file, you could set PYTHONIOENCODING eg, in bash:

$ PYTHONIOENCODING=utf-8 python replace-chars.py iso-8859-1.txt >replaced.utf-8

This command both replaces the characters and transcodes the input from "iso-8859-1" to "utf-8" .

If input filename.txt is already broken (no single character encoding correctly decodes it) then you could try ftfy module to fix common encoding errors:

$ ftfy filename.txt >filename.utf8.txt

Replace unicode characters only once in Python

Question

3 answers

solution1
3 ACCPTED 2015-03-29 20:22:25

solution2
0 2015-03-29 19:44:35

solution3
0 2015-03-29 20:31:48

Replace unicode characters only once in Python

Question

3 answers

solution1 3 ACCPTED 2015-03-29 20:22:25

solution2 0 2015-03-29 19:44:35

solution3 0 2015-03-29 20:31:48

solution1
3 ACCPTED 2015-03-29 20:22:25

solution2
0 2015-03-29 19:44:35

solution3
0 2015-03-29 20:31:48