简体   繁体   English

Python:字典中numpy数组的有效串联

[英]Python: Efficient concatenation of numpy arrays in dictionaries

(Using Python 2.7) (使用Python 2.7)

Note: This is a very similar question to Concatenating dictionaries of numpy arrays of different lengths (avoiding manual loops if possible) . 注意:这与连接不同长度的numpy数组的字典非常相似(如果可能,请避免手动循环) However, I have a slightly different use-case: 但是,我有一个稍微不同的用例:

To summarize the problem: 总结问题:

I have a numpy array of dictionaries that are structurally all the same (meaning they all have the same keys) containing numpy arrays of variable length (including empty). 我有一个字典的numpy数组,这些字典在结构上都是相同的(意味着它们都具有相同的键),其中包含可变长度(包括空)的numpy数组。 Nested data structures FTW! 嵌套的数据结构FTW!

What I want is one "merged" dictionary where for every key, all the numpy arrays are concatenated. 我想要的是一个“合并”字典,其中对于每个键,所有的numpy数组都是串联在一起的。

For example: 例如:

source = [{"a":numpy.array([1,2,3]),"b":numpy.array(['a','b','c'])},{"a":numpy.array([4,5]),"b":numpy.array(['d','e','f','g','h'])}]
# Perform magic here into result
result = {"a":numpy.array([1,2,3,4,5]),"b":numpy.array(['a','b','c','d','e','f','g','h'])}

I could just iterate through every dictionary and use numpy.append() , but I figured that since this is Python and Numpy there should be a more elegant solution using some kind of slicing? 我可以遍历每个字典并使用numpy.append() ,但是我发现由于这是Python和Numpy,应该使用某种切片来提供更优雅的解决方案吗?

Difference to the similar question linked above: 与上面链接的类似问题的区别:

It seems that in the linked question there are only a few dictionaries, and the keys are semantically linked. 似乎在链接的问题中只有几个字典,并且键在语义上是链接的。 For example, in dataset 0 the key "a" is 1, the key "b" is 'a'", and the key "c" is NaN, and so on. However, in my case, there is no connection between "a", "b" and so on. In fact, most of the Panda Table would consist of datasets with NaN. The concatenated "a" might have ten thousand entries, while the concatenated "b" could be empty in an extreme case. Also, I might have hundreds of dictionaries that I want to "concatenate". Finally, the linked question has keys that are present in one dictionary but not in the other. This is impossible in my case. 例如,在数据集0中,键“ a”为1,键“ b”为“ a””,键“ c”为NaN,依此类推。但是,在我的情况下,“实际上,大多数熊猫表都由具有NaN的数据集组成。级联的“ a”可能有上万个条目,而级联的“ b”在极端情况下可能为空。另外,我可能有数百个要“连接”的字典,最后,链接的问题的键在一个字典中却不在另一个字典中,这在我的情况下是不可能的。

I'm wondering, given these circumstances, if the Panda Dataframe approach is still the best way to go, considering I'd need to create a dataframe for every dictionary, and the end result would be a dataframe with LOTS of NaNs. 考虑到我需要为每个字典创建一个数据框,我想知道在这些情况下,Panda Dataframe方法是否仍然是最好的方法,而最终结果将是一个具有NaN的LOTS的数据框。

Thanks! 谢谢!

If it is possible to convert the NumPy arrays to Python lists then you can use collections.Counter : 如果可以将NumPy数组转换为Python列表,则可以使用collections.Counter

In [15]: from collections import Counter                                                         

In [16]: source_ = [{"a":[1,2,3],"b":['a','b','c']}, 
                    {"a": [4,5], "b":['d','e','f','g','h']}]

In [17]: sum((Counter(x) for x in source_), Counter())                                           
Out[17]: Counter({'b': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
                  'a': [1, 2, 3, 4, 5]})

I would iterate over the keys instead of iterating over the dictionaries. 我会遍历键,而不是遍历字典。 This will allow you to use numpy.concatenate , which is more appropriate for this case than numpy.append , and I think it's easier to read. 这将允许您使用numpy.concatenate ,它比numpy.append更适合这种情况,而且我认为它更易于阅读。 I would be surprised if there was a numpy built-in for this, and even if there was, I don't think it would do much for readability or performance. 如果内置了一个numpy,我会感到惊讶,即使有内置的,我也不认为它对可读性或性能有多大帮助。

source = [{"a":numpy.array([1,2,3]),"b":numpy.array(['a','b','c'])}, 
          {"a":numpy.array([4,5]),"b":numpy.array(['d','e','f','g','h'])}]
result = {}
for key in source[0]:
    result[key] = numpy.concatenate([d[key] for d in source])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM