将dtype = object的数据结构转换为dtype = float64的numpy数组

Question

I am trying to convert 'feature1' array from the following data structure into a numpy array so I can input it to sklearn. 我正在尝试将'feature1'数组从以下数据结构转换为numpy数组，以便将其输入到sklearn。 However, I am running in circles as it always tells me that dtype=object is unsuitable, and I am not able to convert it to the desired float64 format. 但是，我绕圈跑，因为它总是告诉我dtype=object不适合，并且我无法将其转换为所需的float64格式。

I want to extract all the 'feature1' as a list of numpy arrays of dtype=float64 , instead of dtype=object from the following structure. 我想从以下结构中提取所有'feature1'作为'feature1' dtype=float64的numpy数组的列表，而不是dtype=object 。

vec is an object returned from an earlier computation. vec是从较早的计算返回的对象。

>>>vec
[{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

I tried the following: 我尝试了以下方法：

>>> t = np.array(list(vec))
>>> t
>>>>array([ {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f5822f'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58233'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58237'), 'vectorized': 1},
   ...,
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead1f'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [2, 2, 0, 0], 'object_id': ObjectId('557beda61d41c8e4d1aead1d'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead27'), 'vectorized': 1}], dtype=object)

Also, 也，

>>> array = np.array([x['feature1'] for x in vec])

as suggested by another user, gives a similar output: 根据另一个用户的建议，给出了类似的输出：

>>> array
>>> array([[], [], [], ..., [], [2, 2, 0, 0], []], dtype=object)

I know I can access the contents of 'feature1' using array[i] , but what I want is to convert the dtype=object to dtype=float64 , and made into a list/dict in which each row will have the 'feature1' of the corresponding entry from vec . 我知道我可以使用array[i]访问'feature1'的内容，但是我想要的是将dtype=object转换为dtype=float64 ，并制成一个列表/字典，其中每行将具有'feature1'来自vec的相应条目。

I also tried using a pandas dataframe, but to no avail. 我也尝试使用pandas数据框，但无济于事。

    >>>>pandaseries = pd.Series(df['feature1']).convert_objects(convert_numeric=True)
    >>>>pandaseries
0     []
1     []
2     []
3     []
4     []
5     []
6     []
7     []
8     []
9     []
10    []
11    []
12    []
13    []
14    []
...
7021                                                   []
7022    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7023                                                   []
7024                                                   []
7025                                                   []
7026                                                   []
7027                                                   []
7028    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7029                                                   []
7030    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7031                                                   []
7032                                       [2, 2, 0.1, 0]
7033                                                   []
7034                                         [2, 2, 0, 0]
7035                                                   []
Name: feature1, Length: 7036, dtype: object
    >>>

Again, dtype: object is returned. 同样，返回dtype: object 。 My guess would be to loop over each row and print a list out. 我的猜测是遍历每一行并打印出一个列表。 But I am unable to do that. 但是我做不到。 Maybe it is a newbie question. 也许这是一个新手问题。 What am I doing wrong? 我究竟做错了什么？

Thanks. 谢谢。

Answer 1

This: 这个：

array = numpy.array ( [ x['feature1'] for x in ver ] )

Or you need to be more clear in your example... 否则您需要在示例中更加清楚...

Answer 2

You can access the value of a dictionary item by using its key: 您可以通过使用其键来访问字典项的值：

d ={'a':1}
d['a'] --> 1

To access items in a list, you can iterate over it or use its index 要访问列表中的项目，可以对其进行迭代或使用其索引

a = [1,  2]

for thing in a:
    # do something with thing

a[0]  --> 1

map conveniently applies a function to all the items of an iterable and returns a list of the results . map方便地将函数应用于iterable的所有项目，并返回结果列表。 operator.getitem returns a function that will retrieve an item from an object. operator.getitem返回一个函数，该函数将从对象中检索项目。

import operator
import numpy as np
feature1 = operator.getitem('feature1')
a = np.asarray(map(feature1, vec))

vec = [{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
       {'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

>>> a = np.asanyarray(map(feature1, vec))
>>> a.shape
(2, 6)
>>> print a
[[ 2.          2.          2.          0.          0.03333333  0.        ]
 [ 2.          2.          1.          0.          0.5         0.        ]]
>>> 
>>> for thing in a[1,:]:
    print type(thing)

<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
>>>

Answer 3

Lets take as the starting point a list of lists or equivalently an object array of lists: 让我们以列表列表或等效的列表对象数组为起点：

A = [[], [], [], [1,2,1], [], [2, 2, 0, 0], []]
A = array([[], [], [], [1,2,1], [], [2, 2, 0, 0], []], dtype=object)

If the sublists were all the same length, np.array([...]) would give you a 2d array, one row for each sublist, and columns matching their common length. 如果子列表的长度相同，则np.array([...])将为您提供一个2d数组，每个子列表一行，并且各列匹配其公共长度。 But since they are unequal in length, it can only make it a 1d array, where each element is a pointer to one of these sublists - ie dtype=object. 但是由于它们的长度不相等，因此只能使其成为一维数组，其中每个元素都是指向这些子列表之一的指针，即dtype = object。

I can imagine 2 ways of constructing a 2d array: 我可以想象构造2d数组的2种方法：

pad each sublist to a common length 将每个子列表填充到相同的长度
insert each sublist into an empty array of the appropriate size. 将每个子列表插入适当大小的空数组中。

Basically it requires common Python iteration; 基本上，它需要通用的Python迭代； it's not a common enough task to have a wiz-bang numpy function. 具有wiz-bang numpy函数并不是一项足够常见的任务。

For example: 例如：

In [346]: n=len(A)
In [348]: m=max([len(x) for x in A])
In [349]: AA=np.zeros((n,m),int)
In [350]: for i,x in enumerate(A):
   .....:     AA[i,:len(x)] = x
In [351]: AA
Out[351]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 2, 1, 0],
       [0, 0, 0, 0],
       [2, 2, 0, 0],
       [0, 0, 0, 0]])

To get a sparse matrix: 要获得稀疏矩阵：

In [352]: from scipy import sparse
In [353]: MA=sparse.coo_matrix(AA)
In [354]: MA
Out[354]: 
<7x4 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>

Nothing magical, just straight forward sparse matrix construction. 没什么神奇的，只是简单的稀疏矩阵构造。 I suppose you could bypass the dense matrix 我想你可以绕开密集矩阵

There is a list-of-lists sparse format that looks a bit like your data. 有一个列表稀疏格式，看起来有点像您的数据。

In [356]: Ml=MA.tolil()

In [357]: Ml.rows
Out[357]: array([[], [], [], [0, 1, 2], [], [0, 1], []], dtype=object)

In [358]: Ml.data
Out[358]: array([[], [], [], [1, 2, 1], [], [2, 2], []], dtype=object)

Conceivably you could construct an empty sparse.lil_matrix((n,m)) matrix, and set it's .data attribute directly. 可以想象，您可以构造一个空的sparse.lil_matrix((n,m))矩阵，并将其直接设置为.data属性。 But you'd also have to calculate the rows attribute. 但是您还必须计算rows属性。

You could also look at the data , row . 您还可以查看data row 。 col attributes of the coo format matrix, and decide it would be easy to construct the equivalent from your A list of lists. col中的属性coo格式矩阵，并决定它会很容易从你的构造相当于A名单列表。

One way or other you have to decide how the non-zero rows get padded to the full length. 您必须决定采用哪种方式将非零行填充为完整长度。

将dtype = object的数据结构转换为dtype = float64的numpy数组

问题描述

3 个解决方案

解决方案1
1 2015-06-20 15:27:18

解决方案2
0 2015-06-20 16:18:08

解决方案3
0 2015-06-21 03:03:33

将dtype = object的数据结构转换为dtype = float64的numpy数组

问题描述

3 个解决方案

解决方案1 1 2015-06-20 15:27:18

解决方案2 0 2015-06-20 16:18:08

解决方案3 0 2015-06-21 03:03:33

解决方案1
1 2015-06-20 15:27:18

解决方案2
0 2015-06-20 16:18:08

解决方案3
0 2015-06-21 03:03:33