简体   繁体   English

Python 在文件中发现不存在的字符,用非预期字符替换(非英文字符的编码问题)

[英]Python finds nonexisting character in file, replaces with nonintended character (encoding issue with non-English characters)

I created a script in python to fix the wrongly-encoded Turkish characters in an.srt file.我在 python 中创建了一个脚本来修复 .srt 文件中错误编码的土耳其字符。 eg 'ý' replaced by the correct character, 'ı'.例如,'ý' 被正确的字符'ı' 替换。

I open the file (read), iterate over the lines to .replace('ý', 'ı') , then write the new set of lines to a new file with 'w', encoding='utf8' .我打开文件(读取),遍历.replace('ý', 'ı')的行,然后使用'w', encoding='utf8'将新的行集写入新文件。 It works great the first time.第一次效果很好。 The issue is that each iteration messes up the fixed character by replacing it with 2 other characters.问题是每次迭代都会用其他 2 个字符替换固定字符,从而弄乱它。 Can provide more info if needed!如果需要可以提供更多信息!

Part of input:部分输入:

yakýn deðillerdi, ama
bir þeyler yapmak istedim

Output the first time around: Output 第一次:

yakın değillerdi, ama
bir şeyler yapmak istedim

Output the second time around: Output 第二次:

yakın değillerdi, ama
bir ÅŸeyler yapmak istedim

Output the third time around: Output 第三次:

yakın değillerdi, ama
bir ÅŸeyler yapmak istedim

And it gets worse every time it runs through.每次运行它都会变得更糟。 Thoughts?想法? If I had to guess, the characters I'm finding ('ý') match with the ('ı') already in the file, then replace it with ('ı') which is wrongly-encoded into ('ı')?如果我不得不猜测,我找到的字符 ('ý') 与文件中已有的 ('ı') 匹配,然后将其替换为 ('ı'),后者被错误编码为 ('ı' )? It's also not a systematic change every time (see second-->third iteration) so I'm stumped.这也不是每次都有系统的变化(见第二次 - >第三次迭代)所以我很难过。 I'm a bit of a newbie so please excuse any "obvious" knowledge I might not have!我有点新手,所以请原谅我可能没有的任何“明显”知识!

edit: The code, as requested:编辑:代码,按要求:

import os

directoryPath = 'D:\\tv\\b99'

fileTypes = ['.srt']

fullFilePaths = []

def get_filepaths(directory, filetype):
    """
    This function will generate the file names in a directory
    tree by walking the tree either top-down or bottom-up. For each
    directory in the tree rooted at directory top (including top itself),
    it yields a 3-tuple (dirpath, dirnames, filenames).
    """
    filePathslist = []
    for root, directories, files in os.walk(directory):
        for filename in files:
            # Join the two strings in order to form the full filepath.
            filepath = os.path.join(root, filename)
            # include only the specific file types, except their hidden/shadow files
            if filepath.endswith(filetype) and not filename.startswith('.'):
                filePathslist.append(filepath)  # Add it to the list.
    return filePathslist

n=0
def replaceChars(folderAsListOfPaths):
    """
    This function takes a list as argument, containing file paths.
    The file is read line by line, and for each of the "special" 
    characters in Turkish that get encoded incorrectly, the appropriate 
    replacement - shown below - is made, and the existing file is overwritten.
    ('ý'->'ı') / ('Ý'->'İ') / ('þ'->'ş') / ('Þ'->'Ş') / ('ð'->'ğ')
    The filenames are printed when the replacement is done, for confirmation.
    """

    # read file line by line
    file = open(folderAsListOfPaths[n], "r")
    lines = file.readlines()

    newFileContent = ''
    for line in lines:
        origLine = line
        fixedLine = origLine.replace('ý', 'ı')
        fixedLine = fixedLine.replace('Ý', 'İ')
        fixedLine = fixedLine.replace('þ', 'ş')
        fixedLine = fixedLine.replace('Þ', 'Ş')
        fixedLine = fixedLine.replace('ð', 'ğ')
        newFileContent += fixedLine
    file.close()

    newFile = open(folderAsListOfPaths[n], 'w', encoding='utf8')
    # print(newFileContent)
    newFile.write(newFileContent)
    newFile.close()

    cleaned_name = folderAsListOfPaths[n].replace(directoryPath, '')
    cleaned_name = cleaned_name.replace('\\', '')
    print(cleaned_name)


for type in fileTypes: 
    fullFilePaths.extend(get_filepaths(directoryPath, type))
# filled the fullFilePaths list with the files

print('Finished with files:')

for file in fullFilePaths:  # for every file in this folder
    replaceChars(fullFilePaths) # replace the characters
    n+=1    # move onto the next file

Error Description: The source encoding is iso-8859-9 .错误描述:源编码为iso-8859-9

Option 1: Read File in Correct Encoding选项 1:以正确的编码读取文件

with open(file_path, 'r', encoding='iso-8859-9') as f:
   # Read file

Option 2: Handle Specific Subtitles for Modifications选项 2:处理修改的特定字幕

Use this custom made fix_turkis_encoding to fix encoding:使用这个定制的fix_turkis_encoding来修复编码:

original_subtitle = 'yakýn deðillerdi, ama\nbir þeyler yapmak istedim'

def fix_turkis_encoding(s):
    source_encoding = 'iso-8859-9'
    destination_encoding = 'utf-8'
    return s.encode(destination_encoding).decode(source_encoding).encode(source_encoding).decode(destination_encoding)

# yakýn deðillerdi, ama
# bir þeyler yapmak istedim
print(fix_turkis_encoding(original_subtitle))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM