简体   繁体   English

Pythonic将字典转换为numpy数组的方法

[英]Pythonic way to convert a dictionary to a numpy array

This is more of a question about programming style. 这更像是关于编程风格的问题。 I scrap webpages for fields such as: "Temperature: 51 - 62", "Height: 1000-1500"...etc The results are saved in a dictionary 我删除了以下字段的网页:“温度:51 - 62”,“高度:1000-1500”......等结果保存在字典中

{"temperature": "51-62", "height":"1000-1500" ...... }

All key and values are string type. 所有键和值都是字符串类型。 Every key can map to one of many possible values. 每个键都可以映射到许多可能值中的一个。 Now I want to convert this dictionary to numpy array/vector. 现在我想将这个字典转换为numpy数组/向量。 I have the following concerns: 我有以下问题:

  • Each key corresponds to one index position in the array. 每个键对应于数组中的一个索引位置。
  • Each possible string value is mapped to one integer. 每个可能的字符串值都映射到一个整数。
  • For some dictionary, some keys are not available. 对于某些字典,某些键不可用。 For example, I also have a dictionary that has no "temperature" key, because that webpage doesn't contain such field. 例如,我也有一个没有“温度”键的字典,因为该网页不包含这样的字段。

I am wondering what is the most clear and efficient way of write such a conversion in Python. 我想知道在Python中编写这种转换的最清晰有效的方法是什么。 I am thinking of building another dictionary maps the key to the index number of the vector. 我正在考虑构建另一个字典,将关键字映射到向量的索引号。 And many other dictionaries that maps the values to integers. 还有许多其他字典将值映射到整数。

Another problem I am having is I am not sure about the range of some keys. 我遇到的另一个问题是我不确定某些键的范围。 I want to dynamically keep track of the mapping between string values and integers. 我想动态跟踪字符串值和整数之间的映射。 For example, I may find that key1 can map to a val1_8 in the future. 例如,我可能会发现key1将来可以映射到val1_8。

Thanks 谢谢

Try a pandas Series, it was built for this. 尝试一个熊猫系列,它是为此而建的。

import pandas as pd
s = pd.Series({'a':1, 'b':2, 'c':3})
s.values # a numpy array
>>> # a sequence of dictionaries in an interable called 'data'
>>> # assuming that not all dicts have the same keys
>>> pprint(data)
  [{'x': 7.0, 'y1': 2.773, 'y2': 4.5, 'y3': 2.0},
   {'x': 0.081, 'y1': 1.171, 'y2': 4.44, 'y3': 2.576},
   {'y1': 0.671, 'y3': 3.173},
   {'x': 0.242, 'y2': 3.978, 'y3': 3.791},
   {'x': 0.323, 'y1': 2.088, 'y2': 3.602, 'y3': 4.43}]

>>> # get the unique keys across entire dataset
>>> keys = [list(dx.keys()) for dx in data]

>>> # flatten and coerce to 'set'
>>> keys = {itm for inner_list in keys for itm in inner_list}

>>> # create a map (look-up table) from each key 
>>> # to a column in a NumPy array

>>> LuT = dict(enumerate(keys))
>>> LuT
  {'y2': 0, 'y3': 1, 'y1': 2, 'x': 3}

>>> idx = list(LuT.values())

>>> # pre-allocate NUmPy array (100 rows is arbitrary)
>>> # number of columns is len(LuT.keys())

>>> D = NP.empty((100, len(LuT.keys())))

>>> keys = list(LuT.keys())
>>> keys
  [0, 1, 2, 3]

>>> # now populate the array from the original data using LuT
>>> for i, row in enumerate(data):
        D[i,:] = [ row.get(LuT[k], 0) for k in keys ]

>> D[:5,:]
  array([[ 4.5  ,  2.   ,  2.773,  7.   ],
         [ 4.44 ,  2.576,  1.171,  0.081],
         [ 0.   ,  3.173,  0.671,  0.   ],
         [ 3.978,  3.791,  0.   ,  0.242],
         [ 3.602,  4.43 ,  2.088,  0.323]])

compare the last result (first 5 rows of D) with data , above 将上一个结果(D的前5行)与上面的数据进行比较

note that the ordering is preserved for each row (a single dictionary) with a less-than-complete set of keys--in other words, column 2 of D always corresponds to the values keyed to y2, , etc., even if the given row in data has no values stored for that key; 请注意,对于每一行(单个字典),使用一组不完整的键保留排序 - 换句话说, D的第2列 始终对应于键入y2的值 ,等等,即使数据中的给定行没有为该键存储的值; eg, look at the third row in data, which has only two key/value pairs, in the third row of D, the first and last column are both 0 , these columns correspond to keys x and y2 , which are in fact the two missing keys 例如,查看数据中的第三行,其中只有两个键/值对,在D的第三行中,第一列和最后一列都是0 ,这些列对应于键xy2 ,实际上是两列缺少钥匙

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM