簡體   English   中英

在Python中只替換一次Unicode字符

[英]Replace unicode characters only once in Python

我正在嘗試創建一個小的腳本來替換文件中的一組字符,如下所示:

# coding=utf-8

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": "ă",
        u"Ã": "Ă",
        u"º": "ș",
        u"ª": "Ș",
        u"þ": "ț",
        u"Þ": "Ț",
    }

    if os.path.isfile(subtitleFileName):
        oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")

        subtitleContent = oldSubtitleFile.read()
        subtitleContent = codecs.encode(subtitleContent, "utf-8")

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)

        oldSubtitleFile.close()

        newSubtitleFile = open(newSubtitleFileName, "wb")
        newSubtitleFile.write(subtitleContent)
        newSubtitleFile.close()

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

第一次運行就可以了。

因此,如果我有一個包含Eºti sigur cã vrei sã ºtergi fiºierele?的文件Eºti sigur cã vrei sã ºtergi fiºierele? ,在該文件上運行腳本后,我得到了Ești sigur că vrei să ștergi fișierele? 這就是我想要的。 但是,如果我多次運行它,則會得到:

EÈtisigurcÄvreisÄÈtergifiÈierele?

EĂàsigurcĂÂvreisĂÂtertergifiĂÂierele?

EÄĂtisigurcÄĂÂvreisÄĂÂÄTERÂTERGIFIÄĂERIERELE?

埃塞俄比亞sigurcĂÂĂÄĂÂvvreisĂÂĂÂĂÂĂÂĂÂĂÂTERGIfiĂÂĂÄĂÄIERELE?

而且我不明白為什么。 如何找到文件中不再存在的某些字符(ã,º等)以替換它們? 為什么還要用其他字符替換它們呢?

很簡單-這是因為在第一次運行時,您正在閱讀ISO-8859-1並編寫UTF-8。 然后,盡管輸入現在是UTF-8而不是ISO-8859-1,但在第二次運行中,您所做的操作完全相同。 在隨后的運行中,搜索和替換不再起作用。

此測試模仿您的第二次迭代-將UTF-8解釋為ISO-8859-1:

# -*- coding: utf-8 -*-
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1")
>> EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

下一個迭代看起來像:

print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1").encode("utf-8").decode("ISO-8859-1")
>> EÃÂti sigur cÃÂ vrei sÃÂ ÃÂtergi fiÃÂierele?

請注意@Daniel的建議,一次解碼,將Unicode替換為Unicode,然后編碼一次。 我還被告知,最好使用io.open()而不是codecs ,因為它與Python 3兼容並解決了通用換行問題。

請勿使用編碼內容。 僅在寫入新文件時編碼:

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": u"ă",
        u"Ã": u"Ă",
        u"º": u"ș",
        u"ª": u"Ș",
        u"þ": u"ț",
        u"Þ": u"Ț",
    }

    if os.path.isfile(subtitleFileName):
        with codecs.open(subtitleFileName, "rb", "ISO-8859-1") as oldSubtitleFile:
            subtitleContent = oldSubtitleFile.read()

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(key, value)

        with codecs.open(newSubtitleFileName, "wb", "utf-8") as newSubtitleFile:
            newSubtitleFile.write(subtitleContent)

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

"utf-8"內容上使用"ISO-8859-1"字符編碼是不正確的:第一次運行腳本時,它將獲取一個文本文件(大概是"ISO-8859-1"編碼)並將其保存為替換某些Unicode字符時為"utf-8"

然后,您第二次運行轉換,則它將獲取"utf-8"內容,並嘗試將其解釋為錯誤的 "ISO-8859-1"

為避免混淆,請與更改字符編碼分開進行替換。 因此,替換將是冪等的。

要進行替換,可以使用fileinput模塊和unicode.translate()

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Replace some characters in 'iso-8859-1'-encoded files."""
import fileinput # read files given on the command-line and/or stdin

replacements = {
    u"ã": u"ă",
    u"Ã": u"Ă",
    u"º": u"ș",
    u"ª": u"Ș",
    u"þ": u"ț",
    u"Þ": u"Ț",
}
# key => ord(key)
replacements = dict(zip(map(ord, replacements.keys()), replacements.values()))
for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):
    print(line.translate(replacements))

要控制輸出文件的編碼,可以在bash中設置PYTHONIOENCODING

$ PYTHONIOENCODING=utf-8 python replace-chars.py iso-8859-1.txt >replaced.utf-8

此命令既替換字符,又將輸入從"iso-8859-1""utf-8"

如果輸入filename.txt已損壞(沒有任何單個字符編碼正確解碼),則可以嘗試ftfy模塊來修復常見的編碼錯誤:

$ ftfy filename.txt >filename.utf8.txt

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM