简体   繁体   中英

handling Unicode filenames in Python 3.4 on Windows

I'm trying to find a reliable way to scan files on Windows in Python, while allowing for the possibility that there may be various Unicode code points in the filenames. I've seen several proposed solutions to this problem, but none of them work for all of the actual issues that I've encountered in scanning filenames created by real-world software and users.

The code sample below is an attempt to extricate and demonstrate the core issue. It creates three files in a subfolder with the sorts of variations I've encountered, and then attempts to scan through that folder and display each filename followed by the file's contents. It will crash on the attempt to read the third test file, with OSError [Errno 22] Invalid argument.

import os

# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.getcwd() + '\\temp'
if not os.path.exists(tempfolder):
    os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))

# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder.encode('UTF-8')):
    for filename in files:
        fullname = os.path.join(tempfolder.encode('UTF-8'), filename)
        print(fullname)
        print(open(fullname,'r').read())

As it says in the code, I just want to be able to display the filenames and open/read the files. Regarding display of the filename, I don't care whether the Unicode characters are rendered correctly for the special cases. I just want to print the filename in a manner that uniquely identifies which file is being processed, and doesn't throw an error for these unusual sorts of filenames.

If you comment out the final line of code, the approach shown here will display all three filenames with no errors. But it won't open the file with miscellaneous Unicode in the name.

Is there a single approach that will reliably display/open all three of these filename variations in Python? I'm hoping there is, and my limited grasp of Unicode subtleties is preventing me from seeing it.

The following works fine, if you save the file in the declared encoding, and if you use an IDE or terminal encoding that supports the characters being displayed. Note that this does not have to be UTF-8. The declaration at the top of the file is the encoding of the source file only.

#coding:utf8
import os

# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.path.join(os.getcwd(),'temp')
if not os.path.exists(tempfolder):
    os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))

# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder):
    for filename in files:
        fullname = os.path.join(tempfolder, filename)
        print(fullname)
        print(open(fullname,'r').read())

Output:

c:\\temp\simple.txt
file contents

c:\temp\with a ® symbol.txt
file contents

c:\temp\with these chars ΣΑΠΦΩ.txt
file contents

If you use a terminal that does not support encoding the characters used in the filename, You will get UnicodeEncodeError . Change:

print(fullname)

to:

print(ascii(fullname))

and you will see that the filename was read correctly, but just couldn't print one or more symbols in the terminal encoding:

'C:\\temp\\simple.txt'
file contents

'C:\\temp\\with a \xae symbol.txt'
file contents

'C:\\temp\\with these chars \u03a3\u0391\u03a0\u03a6\u03a9.txt'
file contents

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM