[英]Python finds nonexisting character in file, replaces with nonintended character (encoding issue with non-English characters)
I created a script in python to fix the wrongly-encoded Turkish characters in an.srt file.我在 python 中创建了一个脚本来修复 .srt 文件中错误编码的土耳其字符。 eg 'ý' replaced by the correct character, 'ı'.
例如,'ý' 被正确的字符'ı' 替换。
I open the file (read), iterate over the lines to .replace('ý', 'ı')
, then write the new set of lines to a new file with 'w', encoding='utf8'
.我打开文件(读取),遍历
.replace('ý', 'ı')
的行,然后使用'w', encoding='utf8'
将新的行集写入新文件。 It works great the first time.第一次效果很好。 The issue is that each iteration messes up the fixed character by replacing it with 2 other characters.
问题是每次迭代都会用其他 2 个字符替换固定字符,从而弄乱它。 Can provide more info if needed!
如果需要可以提供更多信息!
Part of input:部分输入:
yakýn deðillerdi, ama
bir þeyler yapmak istedim
Output the first time around: Output 第一次:
yakın değillerdi, ama
bir şeyler yapmak istedim
Output the second time around: Output 第二次:
yakın değillerdi, ama
bir ÅŸeyler yapmak istedim
Output the third time around: Output 第三次:
yakın değillerdi, ama
bir ÅŸeyler yapmak istedim
And it gets worse every time it runs through.每次运行它都会变得更糟。 Thoughts?
想法? If I had to guess, the characters I'm finding ('ý') match with the ('ı') already in the file, then replace it with ('ı') which is wrongly-encoded into ('ı')?
如果我不得不猜测,我找到的字符 ('ý') 与文件中已有的 ('ı') 匹配,然后将其替换为 ('ı'),后者被错误编码为 ('ı' )? It's also not a systematic change every time (see second-->third iteration) so I'm stumped.
这也不是每次都有系统的变化(见第二次 - >第三次迭代)所以我很难过。 I'm a bit of a newbie so please excuse any "obvious" knowledge I might not have!
我有点新手,所以请原谅我可能没有的任何“明显”知识!
edit: The code, as requested:编辑:代码,按要求:
import os
directoryPath = 'D:\\tv\\b99'
fileTypes = ['.srt']
fullFilePaths = []
def get_filepaths(directory, filetype):
"""
This function will generate the file names in a directory
tree by walking the tree either top-down or bottom-up. For each
directory in the tree rooted at directory top (including top itself),
it yields a 3-tuple (dirpath, dirnames, filenames).
"""
filePathslist = []
for root, directories, files in os.walk(directory):
for filename in files:
# Join the two strings in order to form the full filepath.
filepath = os.path.join(root, filename)
# include only the specific file types, except their hidden/shadow files
if filepath.endswith(filetype) and not filename.startswith('.'):
filePathslist.append(filepath) # Add it to the list.
return filePathslist
n=0
def replaceChars(folderAsListOfPaths):
"""
This function takes a list as argument, containing file paths.
The file is read line by line, and for each of the "special"
characters in Turkish that get encoded incorrectly, the appropriate
replacement - shown below - is made, and the existing file is overwritten.
('ý'->'ı') / ('Ý'->'İ') / ('þ'->'ş') / ('Þ'->'Ş') / ('ð'->'ğ')
The filenames are printed when the replacement is done, for confirmation.
"""
# read file line by line
file = open(folderAsListOfPaths[n], "r")
lines = file.readlines()
newFileContent = ''
for line in lines:
origLine = line
fixedLine = origLine.replace('ý', 'ı')
fixedLine = fixedLine.replace('Ý', 'İ')
fixedLine = fixedLine.replace('þ', 'ş')
fixedLine = fixedLine.replace('Þ', 'Ş')
fixedLine = fixedLine.replace('ð', 'ğ')
newFileContent += fixedLine
file.close()
newFile = open(folderAsListOfPaths[n], 'w', encoding='utf8')
# print(newFileContent)
newFile.write(newFileContent)
newFile.close()
cleaned_name = folderAsListOfPaths[n].replace(directoryPath, '')
cleaned_name = cleaned_name.replace('\\', '')
print(cleaned_name)
for type in fileTypes:
fullFilePaths.extend(get_filepaths(directoryPath, type))
# filled the fullFilePaths list with the files
print('Finished with files:')
for file in fullFilePaths: # for every file in this folder
replaceChars(fullFilePaths) # replace the characters
n+=1 # move onto the next file
Error Description: The source encoding is iso-8859-9
.错误描述:源编码为
iso-8859-9
。
Option 1: Read File in Correct Encoding选项 1:以正确的编码读取文件
with open(file_path, 'r', encoding='iso-8859-9') as f:
# Read file
Option 2: Handle Specific Subtitles for Modifications选项 2:处理修改的特定字幕
Use this custom made fix_turkis_encoding
to fix encoding:使用这个定制的
fix_turkis_encoding
来修复编码:
original_subtitle = 'yakýn deðillerdi, ama\nbir þeyler yapmak istedim'
def fix_turkis_encoding(s):
source_encoding = 'iso-8859-9'
destination_encoding = 'utf-8'
return s.encode(destination_encoding).decode(source_encoding).encode(source_encoding).decode(destination_encoding)
# yakýn deðillerdi, ama
# bir þeyler yapmak istedim
print(fix_turkis_encoding(original_subtitle))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.