[英]Converting a data structure of dtype=object to numpy array of dtype=float64
I am trying to convert 'feature1'
array from the following data structure into a numpy array so I can input it to sklearn. 我正在尝试将
'feature1'
数组从以下数据结构转换为numpy数组,以便将其输入到sklearn。 However, I am running in circles as it always tells me that dtype=object
is unsuitable, and I am not able to convert it to the desired float64
format. 但是,我绕圈跑,因为它总是告诉我
dtype=object
不适合,并且我无法将其转换为所需的float64
格式。
I want to extract all the 'feature1'
as a list of numpy arrays of dtype=float64
, instead of dtype=object
from the following structure. 我想从以下结构中提取所有
'feature1'
作为'feature1'
dtype=float64
的numpy数组的列表,而不是dtype=object
。
vec
is an object returned from an earlier computation. vec
是从较早的计算返回的对象。
>>>vec
[{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]
I tried the following: 我尝试了以下方法:
>>> t = np.array(list(vec))
>>> t
>>>>array([ {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f5822f'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58233'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58237'), 'vectorized': 1},
...,
{'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead1f'), 'vectorized': 1},
{'is_Primary': 1, 'feature1': [2, 2, 0, 0], 'object_id': ObjectId('557beda61d41c8e4d1aead1d'), 'vectorized': 1},
{'is_Primary': 1, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead27'), 'vectorized': 1}], dtype=object)
Also, 也,
>>> array = np.array([x['feature1'] for x in vec])
as suggested by another user, gives a similar output: 根据另一个用户的建议,给出了类似的输出:
>>> array
>>> array([[], [], [], ..., [], [2, 2, 0, 0], []], dtype=object)
I know I can access the contents of 'feature1'
using array[i]
, but what I want is to convert the dtype=object
to dtype=float64
, and made into a list/dict in which each row will have the 'feature1'
of the corresponding entry from vec
. 我知道我可以使用
array[i]
访问'feature1'
的内容,但是我想要的是将dtype=object
转换为dtype=float64
,并制成一个列表/字典,其中每行将具有'feature1'
来自vec
的相应条目。
I also tried using a pandas dataframe, but to no avail. 我也尝试使用pandas数据框,但无济于事。
>>>>pandaseries = pd.Series(df['feature1']).convert_objects(convert_numeric=True)
>>>>pandaseries
0 []
1 []
2 []
3 []
4 []
5 []
6 []
7 []
8 []
9 []
10 []
11 []
12 []
13 []
14 []
...
7021 []
7022 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7023 []
7024 []
7025 []
7026 []
7027 []
7028 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7029 []
7030 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7031 []
7032 [2, 2, 0.1, 0]
7033 []
7034 [2, 2, 0, 0]
7035 []
Name: feature1, Length: 7036, dtype: object
>>>
Again, dtype: object
is returned. 同样,返回
dtype: object
。 My guess would be to loop over each row and print a list out. 我的猜测是遍历每一行并打印出一个列表。 But I am unable to do that.
但是我做不到。 Maybe it is a newbie question.
也许这是一个新手问题。 What am I doing wrong?
我究竟做错了什么?
Thanks. 谢谢。
This: 这个:
array = numpy.array ( [ x['feature1'] for x in ver ] )
Or you need to be more clear in your example... 否则您需要在示例中更加清楚...
You can access the value of a dictionary item by using its key: 您可以通过使用其键来访问字典项的值:
d ={'a':1}
d['a'] --> 1
To access items in a list, you can iterate over it or use its index 要访问列表中的项目,可以对其进行迭代或使用其索引
a = [1, 2]
for thing in a:
# do something with thing
a[0] --> 1
map
conveniently applies a function to all the items of an iterable and returns a list of the results . map
方便地将函数应用于iterable的所有项目,并返回结果列表。 operator.getitem
returns a function that will retrieve an item from an object. operator.getitem
返回一个函数,该函数将从对象中检索项目。
import operator
import numpy as np
feature1 = operator.getitem('feature1')
a = np.asarray(map(feature1, vec))
vec = [{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]
>>> a = np.asanyarray(map(feature1, vec))
>>> a.shape
(2, 6)
>>> print a
[[ 2. 2. 2. 0. 0.03333333 0. ]
[ 2. 2. 1. 0. 0.5 0. ]]
>>>
>>> for thing in a[1,:]:
print type(thing)
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
>>>
Lets take as the starting point a list of lists or equivalently an object array of lists: 让我们以列表列表或等效的列表对象数组为起点:
A = [[], [], [], [1,2,1], [], [2, 2, 0, 0], []]
A = array([[], [], [], [1,2,1], [], [2, 2, 0, 0], []], dtype=object)
If the sublists were all the same length, np.array([...])
would give you a 2d array, one row for each sublist, and columns matching their common length. 如果子列表的长度相同,则
np.array([...])
将为您提供一个2d数组,每个子列表一行,并且各列匹配其公共长度。 But since they are unequal in length, it can only make it a 1d array, where each element is a pointer to one of these sublists - ie dtype=object. 但是由于它们的长度不相等,因此只能使其成为一维数组,其中每个元素都是指向这些子列表之一的指针,即dtype = object。
I can imagine 2 ways of constructing a 2d array: 我可以想象构造2d数组的2种方法:
Basically it requires common Python iteration; 基本上,它需要通用的Python迭代; it's not a common enough task to have a wiz-bang numpy function.
具有wiz-bang numpy函数并不是一项足够常见的任务。
For example: 例如:
In [346]: n=len(A)
In [348]: m=max([len(x) for x in A])
In [349]: AA=np.zeros((n,m),int)
In [350]: for i,x in enumerate(A):
.....: AA[i,:len(x)] = x
In [351]: AA
Out[351]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 2, 1, 0],
[0, 0, 0, 0],
[2, 2, 0, 0],
[0, 0, 0, 0]])
To get a sparse matrix: 要获得稀疏矩阵:
In [352]: from scipy import sparse
In [353]: MA=sparse.coo_matrix(AA)
In [354]: MA
Out[354]:
<7x4 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in COOrdinate format>
Nothing magical, just straight forward sparse matrix construction. 没什么神奇的,只是简单的稀疏矩阵构造。 I suppose you could bypass the dense matrix
我想你可以绕开密集矩阵
There is a list-of-lists sparse format that looks a bit like your data. 有一个列表稀疏格式,看起来有点像您的数据。
In [356]: Ml=MA.tolil()
In [357]: Ml.rows
Out[357]: array([[], [], [], [0, 1, 2], [], [0, 1], []], dtype=object)
In [358]: Ml.data
Out[358]: array([[], [], [], [1, 2, 1], [], [2, 2], []], dtype=object)
Conceivably you could construct an empty sparse.lil_matrix((n,m))
matrix, and set it's .data
attribute directly. 可以想象,您可以构造一个空的
sparse.lil_matrix((n,m))
矩阵,并将其直接设置为.data
属性。 But you'd also have to calculate the rows
attribute. 但是您还必须计算
rows
属性。
You could also look at the data
, row
. 您还可以查看
data
row
。 col
attributes of the coo
format matrix, and decide it would be easy to construct the equivalent from your A
list of lists. col
中的属性coo
格式矩阵,并决定它会很容易从你的构造相当于A
名单列表。
One way or other you have to decide how the non-zero rows get padded to the full length. 您必须决定采用哪种方式将非零行填充为完整长度。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.