简体   繁体   中英

Python 2.7 - Intersect Unicode Dictionary with Unicode List

I'm trying to work with the sets and the intersect method to find which elements in a unicode list of file paths have specific characters in them. The goal is to replace these characters with other characters, so I've made a dictionary of keys and values, where the key is what will be replaced and the values is what it will be replaced with. When I try to generate an intersection set of the paths with the characters to be replaced, however, it results in an empty set. What am I doing wrong? I have this working with for loops, but I'd like to make this as efficient as possible. Feedback is appreciated!

Code:

# -*- coding: utf-8 -*-

import os

def GetFilepaths(directory):
    """
    This function will generate all file names a directory tree using os.walk.
    It returns a list of file paths.
    """
    file_paths = []
    for root, directories, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            file_paths.append(filepath)
    return file_paths

# dictionary of umlauts (key) and their replacements (value)
umlautDictionary = {u'Ä': 'Ae',
                    u'Ö': 'Oe',
                    u'Ü': 'Ue',
                    u'ä': 'ae',
                    u'ö': 'oe',
                    u'ü': 'ue'
                    }

# get file paths in root directory and subfolders
filePathsList = GetFilepaths(u'C:\\Scripts\\Replace Characters\\Umlauts')
print set(filePathsList).intersection(umlautDictionary)

filePathsList is a list of strings:

[u'file1Ä.txt', u'file2Ä.txt', ...]

umlautDictionary is being used as a sequence of keys:

{u'Ä':..., ...}

The intersection is empty because the string u'Ä' doesn't appear in your list of strings. You are comparing u'Ä' to u'file1Ä.txt', which are not equal. Set intersection won't check for substrings.

Since you want to replace the unicode characters in the filename with characters you want, I would suggest the following approach:

umlautDictionary = {u'\xc4': u'Ae'}
filePathsList = [u'file1Ä.txt', u'file2Ä.txt']

words = [w.replace(key, value) for key, value in umlautDictionary.iteritems() for w in filePathsList]

Output:

[u'file1Ae.txt', u'file2Ae.txt']

You would have to store the unicode characters in the form u'\\xc4' for u'Ä' and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM