简体   繁体   中英

Python finds nonexisting character in file, replaces with nonintended character (encoding issue with non-English characters)

I created a script in python to fix the wrongly-encoded Turkish characters in an.srt file. eg 'ý' replaced by the correct character, 'ı'.

I open the file (read), iterate over the lines to .replace('ý', 'ı') , then write the new set of lines to a new file with 'w', encoding='utf8' . It works great the first time. The issue is that each iteration messes up the fixed character by replacing it with 2 other characters. Can provide more info if needed!

Part of input:

yakýn deðillerdi, ama
bir þeyler yapmak istedim

Output the first time around:

yakın değillerdi, ama
bir şeyler yapmak istedim

Output the second time around:

yakın değillerdi, ama
bir ÅŸeyler yapmak istedim

Output the third time around:

yakın değillerdi, ama
bir ÅŸeyler yapmak istedim

And it gets worse every time it runs through. Thoughts? If I had to guess, the characters I'm finding ('ý') match with the ('ı') already in the file, then replace it with ('ı') which is wrongly-encoded into ('ı')? It's also not a systematic change every time (see second-->third iteration) so I'm stumped. I'm a bit of a newbie so please excuse any "obvious" knowledge I might not have!

edit: The code, as requested:

import os

directoryPath = 'D:\\tv\\b99'

fileTypes = ['.srt']

fullFilePaths = []

def get_filepaths(directory, filetype):
    """
    This function will generate the file names in a directory
    tree by walking the tree either top-down or bottom-up. For each
    directory in the tree rooted at directory top (including top itself),
    it yields a 3-tuple (dirpath, dirnames, filenames).
    """
    filePathslist = []
    for root, directories, files in os.walk(directory):
        for filename in files:
            # Join the two strings in order to form the full filepath.
            filepath = os.path.join(root, filename)
            # include only the specific file types, except their hidden/shadow files
            if filepath.endswith(filetype) and not filename.startswith('.'):
                filePathslist.append(filepath)  # Add it to the list.
    return filePathslist

n=0
def replaceChars(folderAsListOfPaths):
    """
    This function takes a list as argument, containing file paths.
    The file is read line by line, and for each of the "special" 
    characters in Turkish that get encoded incorrectly, the appropriate 
    replacement - shown below - is made, and the existing file is overwritten.
    ('ý'->'ı') / ('Ý'->'İ') / ('þ'->'ş') / ('Þ'->'Ş') / ('ð'->'ğ')
    The filenames are printed when the replacement is done, for confirmation.
    """

    # read file line by line
    file = open(folderAsListOfPaths[n], "r")
    lines = file.readlines()

    newFileContent = ''
    for line in lines:
        origLine = line
        fixedLine = origLine.replace('ý', 'ı')
        fixedLine = fixedLine.replace('Ý', 'İ')
        fixedLine = fixedLine.replace('þ', 'ş')
        fixedLine = fixedLine.replace('Þ', 'Ş')
        fixedLine = fixedLine.replace('ð', 'ğ')
        newFileContent += fixedLine
    file.close()

    newFile = open(folderAsListOfPaths[n], 'w', encoding='utf8')
    # print(newFileContent)
    newFile.write(newFileContent)
    newFile.close()

    cleaned_name = folderAsListOfPaths[n].replace(directoryPath, '')
    cleaned_name = cleaned_name.replace('\\', '')
    print(cleaned_name)


for type in fileTypes: 
    fullFilePaths.extend(get_filepaths(directoryPath, type))
# filled the fullFilePaths list with the files

print('Finished with files:')

for file in fullFilePaths:  # for every file in this folder
    replaceChars(fullFilePaths) # replace the characters
    n+=1    # move onto the next file

Error Description: The source encoding is iso-8859-9 .

Option 1: Read File in Correct Encoding

with open(file_path, 'r', encoding='iso-8859-9') as f:
   # Read file

Option 2: Handle Specific Subtitles for Modifications

Use this custom made fix_turkis_encoding to fix encoding:

original_subtitle = 'yakýn deðillerdi, ama\nbir þeyler yapmak istedim'

def fix_turkis_encoding(s):
    source_encoding = 'iso-8859-9'
    destination_encoding = 'utf-8'
    return s.encode(destination_encoding).decode(source_encoding).encode(source_encoding).decode(destination_encoding)

# yakýn deðillerdi, ama
# bir þeyler yapmak istedim
print(fix_turkis_encoding(original_subtitle))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM