简体   繁体   English

在python中读取和写入unicode/non-ascii字符到文件时遇到问题

[英]Having trouble reading and writing unicode/non-ascii characters to file in python

I have a directory structure that contains many directories with non-ascii characters, mostly sanskrit.我有一个目录结构,其中包含许多非 ascii 字符的目录,主要是梵文。 I am working on indexing these directories/files in a script, but can't figure out how best to handle these instances.我正在为脚本中的这些目录/文件编制索引,但无法弄清楚如何最好地处理这些实例。 This is my process:这是我的过程:

  • hash all files, recursively, write the path, filename, and hash of each to a .tsv file.递归地散列所有文件,将每个文件的路径、文件名和散列写入 .tsv 文件。
  • go through this file, sorting each line by whether a duplicate of the hash exists.浏览这个文件,按是否存在散列的重复项对每一行进行排序。 results in a dictionary with this form: {'path': columns[0], 'filename': columns[1], 'status': True} , where status determines whether an action is later taken on the file.生成具有以下形式的字典: {'path': columns[0], 'filename': columns[1], 'status': True} ,其中 status 确定以后是否对文件采取操作。
  • go through this dictionary, move duplicates out of their original location and into an offset-root path (./duplicates rather than ./, for instance).通过这本字典,将重复项从其原始位置移到偏移根路径(例如,./duplicates 而不是 ./)。
  • write to a file for each move a command to run that will reverse the move, if necessary (just mv ab );为每个移动写入一个要运行的命令,如果需要,该命令将反转移动(仅mv ab ); this isn't important, but thought I'd include it.这并不重要,但我想我会把它包括在内。

Below is some sample data and what I've written so far:以下是一些示例数据以及我到目前为止所写的内容:

Sample generated tsv (path/name/hash):示例生成的 tsv(路径/名称/哈希):

./Personal Research/Ramnad 9"14"10  DSC_0004.JPG    850cd9dcb0075febd4c0dcd549dd7860        
./Personal Research/Ramnad 9"14"10  DSC_0010.JPG    9db2219fc4c9423016fb9e295452f1ad        
./Personal Research/Ramnad 9"14"10  DSC_0006.JPG    ef7d13b88bbaabc029390bcef1319bb1            

The " is actually unicode: "实际上是Unicode:

Block: Private Use Area街区:私人使用区
Unicode: U+F019统一码: U+F019
UTF-8: 0xEF 0x80 0x99 UTF-8: 0xEF 0x80 0x99
JavaScript: 0xF019 JavaScript: 0xF019

Code: writing the above to file (fulltsv):代码:将上述内容写入文件(fulltsv):

for root, dirs, files in os.walk(SRC_DIR, topdown=True):
        files[:] = [f for f in files if any(ext in f for ext in EXT_LIST) if not f.startswith('.')]
        for file in files:
            with open(os.path.join(root,file),'r') as f:
                with open(SAVE_DIR+re.sub(r'\W+', '', os.path.basename(root).lower())+'.tsv', 'a') as fout:
                    writer = csv.writer(fout, delimiter='\t', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
                    checksums = []
                    with open(os.path.join(root, file), 'rb') as _file:
                        checksums.append([root, file, hashlib.md5(_file.read()).hexdigest()])
                        writer.writerows(checksums)

reading from that file:从该文件中读取:

#       generate list of all tsv
for (dir, subs, files) in os.walk(ROOT):
    #   remove the new-root from the search
    subs = [s for s in subs if NROOT not in s]
    for f in files:
        fpath = os.path.join(dir,f)
        if ".tsv" in fpath:
            TSVLIST.append(fpath)

#       open/append all TSV content to a single new TSV
with open(FULL,'w') as wfd:
    for f in TSVLIST:
        with open(f,'r') as fd:
            wfd.write(fd.read())
            lines = sum(1 for line in f)

#   add all entries to a dictionary keyed to their hash
entrydict = {}

ec = 0
with open(FULL, 'r') as fulltsv:
    for line in fulltsv:
        columns = line.strip().split('\t')
        if not columns[2].startswith('.'):
            if columns[2] not in entrydict.keys():
                entrydict[str(columns[2])] = []

            entrydict[str(columns[2])].append({'path': columns[0], 'filename': columns[1], 'status': True})
            if len(entrydict[str(columns[2])]) > 1:
                ec += 1

ed = {k:v for k,v in entrydict.items() if len(v)>=2}

moving duplicates:移动副本:

 for e in f:
            if len(f)-mvcnt > 1:
                if e['status'] is True:
                    p = e['path']    #   path
                    n = e['filename']   #   name
                    n0,n0ext = os.path.splitext(n)
                    n1 = n

                    #   directory structure for new file
                    FROOT = p.replace(p.split('/')[0],NROOT,1)
n1 = n

                    rebk = 'mv {0}/{1} {2}/{3}'.format(FROOT,n,p,n)
                    shutil.move('{0}/{1}'.format(p,n),'{0}/{1}'.format(FROOT,n))
                    dupelist.write('{0} #{1}\n'.format(rebk,str(h)))
                    mvcnt += 1

Errors I'm getting:我得到的错误:

Traceback (most recent call last):
  File "/usr/lib/python3.6/shutil.py", line 550, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF' -> './duplicateRoot/Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dCompare.py", line 164, in <module>
    shutil.move('{0}/{1}'.format(p,n),'{0}/{1}'.format(FROOT,n))
  File "/usr/lib/python3.6/shutil.py", line 564, in move
    copy_function(src, real_dst)
  File "/usr/lib/python3.6/shutil.py", line 263, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'

Obviously the has to do with how I'm handling unicode characters, but I've never worked with this before and am not sure at which point/how I should be handling the filenames.显然这与我处理 unicode 字符的方式有关,但我以前从未使用过它,并且不确定在哪一点/应该如何处理文件名。 Working on ubuntu 10 under windows subsystem for linux, python 3.在适用于 linux、python 3 的 windows 子系统下的 ubuntu 10 上工作。

The one problem I see when I read over the stack-trace is that the Unicode characters are wrong (they're not there) given OP's sample TSV:我在阅读堆栈跟踪时看到的一个问题是,鉴于 OP 的示例 TSV,Unicode 字符是错误的(它们不存在):

FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF' -> './duplicateRoot/Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'

There's some quote escaping in the source and destination paths that I believe shouldn't be there, the extra and double double-quotes, looks like the path was broken up and concatenated again (or something):在源路径和目标路径中有一些引号转义,我认为不应该在那里,额外的双引号,看起来路径被分解并再次连接(或其他东西):

'"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'

I attempted to recreate OP's error, but couldn't.我试图重新创建 OP 的错误,但不能。 But, when I was working through the sample below, I originally got a FileNotFoundError (because I was missing the destination folders, hence os.makedirs() in my sample), but the path was correctly encoded:但是,当我处理下面的示例时,我最初遇到了FileNotFoundError (因为我缺少目标文件夹,因此在我的示例中缺少os.makedirs() ),但路径编码正确:

FileNotFoundError: [Errno 2] No such file or directory: 'foo/Personal Research/Ramnad 9\14\10/DSC_0006.JPG'

All I can offer is speculation that the encoding is messed up either in the TSV file, or in entrydict .我所能提供的只是推测编码在 TSV 文件或entrydict被搞砸了。 OP, have you inspected that file or dict in the interpreter and verified you're seeing \ in the paths where you expect? OP,您是否在解释器中检查过该文件或 dict 并确认您在预期的路径中看到了\ Maybe something like the following to make sure those codepoints are present:可能类似于以下内容以确保这些代码点存在:

>>> print(path.encode('unicode_escape'))
b'./Personal Research/Ramnad 9\\uf01914\\uf01910'
>>> # or, look for 61465
>>> [ord(char) for char in path]
[46, 47, 80, 101, 114, 115, 111, 110, 97, 108, 32, 82, 101, 
115, 101, 97, 114, 99, 104, 47, 82, 97, 109, 110, 97, 100, 
32, 57, 61465, 49, 52, 61465, 49, 48]

Here is my attempt, it might help...这是我的尝试,它可能会有所帮助...

I created a sample TSV file and the corresponding directory structure:我创建了一个示例 TSV 文件和相应的目录结构:

>>> p='./Personal Research/Ramnad 9\uf01914\uf01910'
>>> os.makedirs(p)
>>> checksums=[[p, 'DSC_0006.JPG', 'hash']]
>>> with open('full.tsv', 'a') as fout:
    writer = csv.writer(fout, delimiter='\t', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
    writer.writerows(checksums)

and touched the file in the shell:并在shell中触摸了文件:

$ touch Personal\ Research/Ramnad\ 91410/DSC_0006.JPG

Inspected full.tsv to make sure it was correctly written to:检查full.tsv以确保它正确写入:

$cat full.tsv
./Personal Research/Ramnad 91410  DSC_0006.JPG    hash

The empty-blocks are the properly utf-8-encoded codepoint based on the Unicode description of " that OP included.空块是基于 OP 包含的"的 Unicode 描述"的正确 utf-8 编码的代码点。

Ran hexdump -C full.tsv to ensure the utf-8 encoding (look for 2 sets of ef 80 99 ):运行hexdump -C full.tsv以确保 utf-8 编码(查找 2 组ef 80 99 ):

00000010  72 63 68 2f 52 61 6d 6e  61 64 20 39 ef 80 99 31  |rch/Ramnad 9...1|
00000020  34 ef 80 99 31 30 09 44  53 43 5f 30 30 30 36 2e  |4...10.DSC_0006.|

I then ran然后我跑了

>>> entrydict = {}

>>> ec = 0
>>> with open('full.tsv', 'r') as fulltsv:
    for line in fulltsv:
        columns = line.strip().split('\t')
        if not columns[2].startswith('.'):
            if columns[2] not in entrydict.keys():
                entrydict[str(columns[2])] = []

            entrydict[str(columns[2])].append({'path': columns[0], 'filename': columns[1], 'status': True})
            if len(entrydict[str(columns[2])]) > 1:
                ec += 1

>>> entrydict
{'hash': [{'path': './Personal Research/Ramnad 9\uf01914\uf01910', 'filename': 'DSC_0006.JPG', 'status': True}]}`

And finally:最后:

>>> e = entrydict['hash'][0]
>>> e
{'path': './Personal Research/Ramnad 9\uf01914\uf01910', 'filename': 'DSC_0006.JPG', 'status': True}
>>> NROOT='foo'
>>> if e['status'] is True:
    p = e['path']    #   path
    n = e['filename']   #   name
    n0,n0ext = os.path.splitext(n)
    n1 = n

    #   directory structure for new file
    FROOT = p.replace(p.split('/')[0],NROOT,1)


    rebk = 'mv {0}/{1} {2}/{3}'.format(FROOT,n,p,n)
    print(rebk)
    src='{0}/{1}'.format(p,n)
    dst='{0}/{1}'.format(FROOT,n)
    os.makedirs(FROOT)
    shutil.move(src,dst)

and it worked.它奏效了。 Bummer.无赖。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM