I created a script in python to fix the wrongly-encoded Turkish characters in an.srt file. eg 'ý' replaced by the correct character, 'ı'.
I open the file (read), iterate over the lines to .replace('ý', 'ı')
, then write the new set of lines to a new file with 'w', encoding='utf8'
. It works great the first time. The issue is that each iteration messes up the fixed character by replacing it with 2 other characters. Can provide more info if needed!
Part of input:
yakýn deðillerdi, ama
bir þeyler yapmak istedim
Output the first time around:
yakın değillerdi, ama
bir şeyler yapmak istedim
Output the second time around:
yakın değillerdi, ama
bir ÅŸeyler yapmak istedim
Output the third time around:
yakın değillerdi, ama
bir ÅŸeyler yapmak istedim
And it gets worse every time it runs through. Thoughts? If I had to guess, the characters I'm finding ('ý') match with the ('ı') already in the file, then replace it with ('ı') which is wrongly-encoded into ('ı')? It's also not a systematic change every time (see second-->third iteration) so I'm stumped. I'm a bit of a newbie so please excuse any "obvious" knowledge I might not have!
edit: The code, as requested:
import os
directoryPath = 'D:\\tv\\b99'
fileTypes = ['.srt']
fullFilePaths = []
def get_filepaths(directory, filetype):
"""
This function will generate the file names in a directory
tree by walking the tree either top-down or bottom-up. For each
directory in the tree rooted at directory top (including top itself),
it yields a 3-tuple (dirpath, dirnames, filenames).
"""
filePathslist = []
for root, directories, files in os.walk(directory):
for filename in files:
# Join the two strings in order to form the full filepath.
filepath = os.path.join(root, filename)
# include only the specific file types, except their hidden/shadow files
if filepath.endswith(filetype) and not filename.startswith('.'):
filePathslist.append(filepath) # Add it to the list.
return filePathslist
n=0
def replaceChars(folderAsListOfPaths):
"""
This function takes a list as argument, containing file paths.
The file is read line by line, and for each of the "special"
characters in Turkish that get encoded incorrectly, the appropriate
replacement - shown below - is made, and the existing file is overwritten.
('ý'->'ı') / ('Ý'->'İ') / ('þ'->'ş') / ('Þ'->'Ş') / ('ð'->'ğ')
The filenames are printed when the replacement is done, for confirmation.
"""
# read file line by line
file = open(folderAsListOfPaths[n], "r")
lines = file.readlines()
newFileContent = ''
for line in lines:
origLine = line
fixedLine = origLine.replace('ý', 'ı')
fixedLine = fixedLine.replace('Ý', 'İ')
fixedLine = fixedLine.replace('þ', 'ş')
fixedLine = fixedLine.replace('Þ', 'Ş')
fixedLine = fixedLine.replace('ð', 'ğ')
newFileContent += fixedLine
file.close()
newFile = open(folderAsListOfPaths[n], 'w', encoding='utf8')
# print(newFileContent)
newFile.write(newFileContent)
newFile.close()
cleaned_name = folderAsListOfPaths[n].replace(directoryPath, '')
cleaned_name = cleaned_name.replace('\\', '')
print(cleaned_name)
for type in fileTypes:
fullFilePaths.extend(get_filepaths(directoryPath, type))
# filled the fullFilePaths list with the files
print('Finished with files:')
for file in fullFilePaths: # for every file in this folder
replaceChars(fullFilePaths) # replace the characters
n+=1 # move onto the next file
Error Description: The source encoding is iso-8859-9
.
Option 1: Read File in Correct Encoding
with open(file_path, 'r', encoding='iso-8859-9') as f:
# Read file
Option 2: Handle Specific Subtitles for Modifications
Use this custom made fix_turkis_encoding
to fix encoding:
original_subtitle = 'yakýn deðillerdi, ama\nbir þeyler yapmak istedim'
def fix_turkis_encoding(s):
source_encoding = 'iso-8859-9'
destination_encoding = 'utf-8'
return s.encode(destination_encoding).decode(source_encoding).encode(source_encoding).decode(destination_encoding)
# yakýn deðillerdi, ama
# bir þeyler yapmak istedim
print(fix_turkis_encoding(original_subtitle))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.