简体   繁体   中英

How to efficiently flatten a dictionary of Sorted dictionaries to numpy.arrays

I was wondering whether there is a more efficient way to flatten my data. See below example of the data structure being flattened:

 {t: SortedDict(
    {0: {'t': 5, 'ids': [{'1': ['data']}]}, 
    1: {'t': 2, 'ids': [{'1': ['data']}]}, 
    2: {'t': 4, 'ids': [{'1': ['data']}]}, 
    3: {'t': 1, 'ids': [{'1': ['data']}]}, 
    4: {'t': 4, 'ids': [{'1': ['data']}]}, 
    5: {'t': 1, 'ids': [{'1': ['data']}]}, 
    6: {'t': 3, 'ids': [{'1': ['data']}]}, 
    7: {'t': 2, 'ids': [{'1': ['data']}]}, 
    8: {'t': 1, 'ids': [{'1': ['data']}]}, 
    9: {'t': 1, 'ids': [{'1': ['data']}]}
    }),t:SortedDict(
    {
    27: {'t': 1, 'ids': [{'5': ['data','data']}]}, 
    28: {'t': 1, 'ids': [{'5': ['data','data','data','data']}]}, 
    29: {'t': 2, 'ids': [{'5': ['data','data']}]}, 
    30: {'t': 1, 'ids': [{'5': ['data']}]}, 
    31: {'t': 2, 'ids': [{'5': ['data','data','data','data']}]}, 
    32: {'t': 1, 'ids': [{'5': ['data']}]}
    })}

Note: SortedDict comes from Sorted Containers library which is an Apache2 licensed Python sorted collections.

I have evaluated several other stackoverflow posts that do something similar with list comprehension or with a lambda function. Ultimately, I wrote a method that flattens the dictionary in to three list; however, I'm not sure if this approach is the optimal one. The method is as follows:

def flatten(self, d,calculation_dict):
    l_key       = [] # Stores linearized keys
    l_results   = [] # Stores linearized values after calculation
    index       = [] # Stores the start of each individual sub-array
    i = 0
    for val in d.values():
            index.append(i)
            for key, t in val.t.items():
                #Add calculation in here since I am Iterating over every element
                l_results.append(t["t"] *  calculation_dict[key]) 
                l_key.append(key)
                i += 1
    h_index = numpy.array(index, dtype=numpy.int32)
    h_l_results = numpy.array(l_results,dtype=numpy.float)
    l_key = numpy.array(l_key, dtype=numpy.int32)
    index.append(i) 
    return (l_key,l_results,index)
    
    #Need output to be numpy.array
    l_key       = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 27, 28, 29, 30, 31, 32]
    l_results   = [5.0, 2.0, 4.0, 1.0, 4.0, 1.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 2.0, 1,0]
    index       = [0, 10]

In my application speed is extremely important. So, any feedback or suggestions will be greatly appreciated.

Edit: Forgot to mention that I need my final results in a numpy array. Not sure if that changes anything.

Edit: Thanks to Glauco suggestion I modified the flattened Method as follows:

def flatten_numpy(self, d,calculation_dict):
        l_results   = numpy.empty(self.size,dtype=numpy.float)
        l_key       = numpy.empty(self.size, dtype=numpy.int32)
        index       = []
        i = 0
        for val in d.values():
                index.append(i)
                for key, t in val.t.items():
                    l_results[i] = (tf["tf"] *  idf[term]) 
                    l_key[i] = term
                    i += 1
        index.append(i) 
        h_index = numpy.array(index, dtype=numpy.int32)
        return (l_key,l_results,index)

It turns out that earlier in the Algorithm, I already had to access the size of each sub dictionary. Taking advantage of this I started accumulating this value size variable, and after testing the new approach it is slightly faster. Test results are below:

#Each Test was executed on the different data and ran 1000 times
Test#1 | Flatten        6.422301292419434   | Flatten_numpy     4.761376142501831
Test#2 | Flatten        5.212526082992554   | Flatten_numpy     4.901215553283691
Test#3 | Flatten        5.2060017585754395  | Flatten_numpy     5.266955852508545
Test#4 | Flatten        6.079436302185059   | Flatten_numpy     4.803238153457642
Test#5 | Flatten        5.059106349945068   | Flatten_numpy     4.565468788146973

Your approach is algorithmically right it's O(n+m) it's linear, no other way. If you know haom many dict will arrives from cluster, so it's more convenient to create empty numpy data structures and fill it at run time, avoiding list appends.

Finally, the t compuatation:

l_results.append(t["t"] * calculation_dict[key])

can be done quickly using arrays, at the bottom of collection phase

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM