简体   繁体   English

试图从 python 字典创建一个二维数组

[英]Trying to create a 2D array from python dictionary

I am trying to create a 2D array from dictionary in python.我正在尝试从 python 中的字典创建一个二维数组。

mydictionary={
'a':['review','read','study'],
'b':['read'],
'c':['review','dictate']}

I want to have a 2D array that shows the number of items matching.(ie compare the keys and their values and store the matching values in a 2D array)我想要一个显示匹配项数的二维数组。(即比较键和它们的值并将匹配的值存储在二维数组中)

Output Format:

       a       b       c
  ___|___________________
  a  |  3       1       1
     |
  b  |  1       1       0
     |
  c  |  1       0       2

My dictionary has around 7000 items.我的字典有大约 7000 项。 What is the best way to achieve this?实现这一目标的最佳方法是什么? Thanks谢谢

A sweet way to obtain result is to use pandas , the numpy big brother : 获得结果的一种不错的方法是使用熊猫 (Numpy的大哥):

In [6]: md=mydictionary
In [7]: df=pd.DataFrame([[len(set(md[i])&set(md[j])) for j in md] for i in md],md,md)
In [8]: df
Out[8]: 
   c  a  b
c  2  1  0
a  1  3  1
b  0  1  1

If order matter : 如果订单很重要:

In [9]: df.sort_index(0).sort_index(1)
Out[9]: 
   a  b  c
a  3  1  1
b  1  1  0
c  1  0  2

For starters, you can use the fact that the diagonals are just the lengths of the individual lists. 对于初学者,您可以使用对角线只是各个列表的长度的事实。

Then the matrix is perfectly symmetric, so you only need to compute the value for (a,b) and not both (a,b), (b,a) 那么矩阵是完全对称的,因此您只需要计算(a,b)的值(a,b)而不是两个(a,b), (b,a)

Past that, you can compute the size of their intersection for each pair: 除此之外,您可以计算每对交叉点的交点大小:

len([filter(lambda x: x in arr1, subArr) for subArr in arr2])

You can form the list however you like but forming the sets first will be faster than repeatedly creating sets: 您可以随意创建列表,但是先形成集合比重复创建集合要快:

new = {k: set(v) for k, v in mydictionary.items()}
out = OrderedDict()
for k, v in new.items():
    out[k] = [k, len(v)]
    for k2, v2 in new.items():
        if k2 == k:
            continue
        out[k].append(sum(val in v for val in v2))


print(list(out.values()))

Output: 输出:

[['a', 3, 1, 1], ['c', 2, 1, 0], ['b', 1, 1, 0]]

The other solutions offered here are suitable for smaller lists of inputs, but as the list grows they will scale as O[N^2] (at best) which may be relatively slow in your case. 此处提供的其他解决方案适用于较小的输入列表,但是随着列表的增加,它们将随着O[N^2] (最好)缩放(在您的情况下可能相对较慢)。 Here's an approach using scikit-learn's DictVectorizer that should be faster for large inputs with small amounts of overlap. 这是使用scikit-learn的DictVectorizer ,对于具有少量重叠的大型输入,该方法应该更快。

The idea is to construct a one-hot encoding of the input, and then use a matrix product to compute the final result: 这个想法是构造输入的单次编码,然后使用矩阵乘积来计算最终结果:

from sklearn.feature_extraction import DictVectorizer

keys, vals = zip(*mydictionary.items())
valsdict = [dict(zip(val, repeat(1))) for val in vals]

V = DictVectorizer().fit_transform(valsdict)
result = V.dot(V.T)

The result will be a scipy.sparse matrix, which only explicitly stores nonzero elements. 结果将是scipy.sparse矩阵,该矩阵仅显式存储非零元素。 You can convert it to a dense array form with result.toarray() ; 您可以使用result.toarray()将其转换为密集数组形式; using pandas you can also apply the labels to the rows and columns: 使用熊猫,您还可以将标签应用于行和列:

import pandas as pd
pd.DataFrame(result.toarray(), keys, keys)
#    a  c  b
# a  3  1  1
# c  1  2  0
# b  1  0  1

I expect this will be significantly faster than the other solutions posted here as the size of the inputs grow. 我希望这将是比这里张贴的投入增长规模的其它解决方案显著更快。


Edit: here's a benchmark on a 1000-item input where about half of the pairs have some overlap: 编辑:这是一个关于1000个项目输入的基准,其中约有一半的对有一些重叠:

import numpy as np
import pandas as pd
from itertools import repeat
from sklearn.feature_extraction import DictVectorizer

def dense_method(md):
    return pd.DataFrame([[len(set(md[i]) & set(md[j]))
                          for j in md]
                         for i in md], md, md)

def sparse_method(mydictionary):
    keys, vals = zip(*mydictionary.items())
    valsdict = [dict(zip(val, repeat(1))) for val in vals]
    V = DictVectorizer().fit_transform(valsdict)
    return pd.DataFrame(V.dot(V.T).toarray(), keys, keys)


mydictionary = {i: np.random.randint(0, 20, 3)
                for i in range(1000)}

print(np.allclose(dense_method(mydictionary),
                  sparse_method(mydictionary)))
# True

%timeit sparse_method(mydictionary)
# 100 loops, best of 3: 19.5 ms per loop

%timeit dense_method(mydictionary)
# 1 loops, best of 3: 3.41 s per loop

The sparse method is two orders of magnitude faster here. 此处的稀疏方法要快两个数量级。

Surely not the most elegant way to perform that task, but it works 当然,这不是执行该任务的最优雅的方法,但它确实有效

import numpy as np
N = len(mydictionary)
freqs = np.zeros(shape=(N, N), dtype=np.int)
mykeys = sorted(mydictionary.keys())
for i, x in enumerate(mykeys):
    freqs[i, i] = len(mydictionary[x])
    for j in range(i+1, N):
        for elem in mydictionary[x]:
            if elem in mydictionary[mykeys[j]]:
                freqs[i, j] += 1
                freqs[j, i] += 1
print freqs
#[[3 1 1]
# [1 2 0]
# [1 0 1]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM