简体   繁体   中英

Build Dictionary and List from Numpy Ndarray

I wanted to create a dictionary using 2D ndarray for some millions of data.

Looking for a pythonic and performant way to achieve this

My ndarray:

format: [id, origin_lat, origin_lon, dest_lat,dest_lon, distance]

my_array = np.array([[245, 32.45,63.89,72.1,63.57,123.45],
[246, 61.73,42.71,75.54,-81.69,16.32]])

Expected Output:

my_dict = {
        245: {
            'origin_lat_lon': {
                'lat': 32.45,
                'lon': 63.89
            },
            'dest_lat_lon': {
                'lat': 72.1,
                'lon': 63.57
            },
            'distance': 123.45
        },
        246: {
            'origin_lat_lon': {
                'lat': 61.73,
                'lon': 42.71
            },
            'dest_lat_lon': {
                'lat': 75.54,
                'lon': -81.69
            },
            'distance': 16.32
        }
    }

my_list = [{'lat': 32.45, 'lon': 63.89},
 {'lat': 72.1, 'lon': 63.57},
 {'lat': 61.73, 'lon': 42.71},
 {'lat': 75.54, 'lon': -81.69}]

My code:

my_dict = dict()
my_list = list()

for arr in my_array:
    origin_lat_lon = {'lat': arr[1],
                            'lon': arr[2]}
    dest_lat_lon  = {'lat': arr[3],
                  'lon': arr[4]}
    value = {'origin_lat_lon':origin_lat_lon,'dest_lat_lon':dest_lat_lon,'distance':arr[5]}
    my_dict[int(arr[0])]=value
    my_list.append(origin_lat_lon)
    my_list.append(dest_lat_lon)

This is one approach using dict with zip and slicing .

Ex:

import numpy as np

my_array = np.array([[245, 32.45,63.89,72.1,63.57,123.45],[246, 61.73,42.71,75.54,-81.69,16.32]])
keys = ['origin_lat', 'origin_lon', 'dest_lat','dest_lon', 'distance']
keys_2 = ['lat', 'lon']

my_dict = {}
my_list = []

for arr in my_array:
    key, vals = arr[0], arr[1:]
    my_dict[int(key)] = dict(zip(keys, vals))
    my_list.extend([[dict(zip(keys_2, vals[0:2]))],[dict(zip(keys_2, vals[2:4]))]])

print(my_dict)
print(my_list)

Output:

{245: {'dest_lat': 72.1,
       'dest_lon': 63.57,
       'distance': 123.45,
       'origin_lat': 32.45,
       'origin_lon': 63.89},
 246: {'dest_lat': 75.54,
       'dest_lon': -81.69,
       'distance': 16.32,
       'origin_lat': 61.73,
       'origin_lon': 42.71}}
[[{'lat': 32.45, 'lon': 63.89}],
 [{'lat': 72.1, 'lon': 63.57}],
 [{'lat': 61.73, 'lon': 42.71}],
 [{'lat': 75.54, 'lon': -81.69}]]

Your code wrapped in a function, times:

In [220]: timeit foo(my_array)                                                  
7.14 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

converting the array to a list cuts time in half. tolist() is a (relatively) fast method for converting an array to a nested list. Iterating on a list is faster than iterating on an array:

In [221]: timeit foo(my_array.tolist())                                         
2.68 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Rakesh's version, is somewhat slower (I haven't identified why):

In [222]: timeit rakesh(my_array)                                               
18.5 µs ± 63.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [223]: timeit rakesh(my_array.tolist())                                      
9.49 µs ± 26.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Chris's pandas version is quite a bit slower. pandas does have a nice interface to/from dictionaries, but apparently it isn't fast. It probably is pure Python, and looses speed by being general purpose.

In [224]: timeit foo_pd(my_array)                                               
3.35 ms ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Python dictionaries are efficient for what they do, but they still have to be accessed key by key. numpy does not have of its own compiled code for working with dictionaries.

===

Your array could be cast as a structured array. With that columns are replaced fields, which are accessed by name. So it's more dictionary-like, though probably not any better for creating a json output. (And it's not a speed tool)

In [225]: dt = np.dtype([('id',int),('origin_lat',float),('origin_lon',float),('
     ...: dest_lat',float),('dest_lon',float),('distance',float)])              
In [226]: import numpy.lib.recfunctions as rf                                   

In [228]: sarr =rf.unstructured_to_structured(my_array, dt)                     
In [229]: sarr                                                                  
Out[229]: 
array([(245, 32.45, 63.89, 72.1 ,  63.57, 123.45),
       (246, 61.73, 42.71, 75.54, -81.69,  16.32)],
      dtype=[('id', '<i8'), ('origin_lat', '<f8'), ('origin_lon', '<f8'), ('dest_lat', '<f8'), ('dest_lon', '<f8'), ('distance', '<f8')])

In [230]: sarr['dest_lon']                                                      
Out[230]: array([ 63.57, -81.69])

In [236]: timeit sarr =rf.unstructured_to_structured(my_array, dt)              
46.3 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM