简体   繁体   English

将dtype = object的数据结构转换为dtype = float64的numpy数组

[英]Converting a data structure of dtype=object to numpy array of dtype=float64

I am trying to convert 'feature1' array from the following data structure into a numpy array so I can input it to sklearn. 我正在尝试将'feature1'数组从以下数据结构转换为numpy数组,以便将其输入到sklearn。 However, I am running in circles as it always tells me that dtype=object is unsuitable, and I am not able to convert it to the desired float64 format. 但是,我绕圈跑,因为它总是告诉我dtype=object不适合,并且我无法将其转换为所需的float64格式。

I want to extract all the 'feature1' as a list of numpy arrays of dtype=float64 , instead of dtype=object from the following structure. 我想从以下结构中提取所有'feature1'作为'feature1' dtype=float64的numpy数组的列表,而不是dtype=object

vec is an object returned from an earlier computation. vec是从较早的计算返回的对象。

>>>vec
[{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

I tried the following: 我尝试了以下方法:

>>> t = np.array(list(vec))
>>> t
>>>>array([ {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f5822f'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58233'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58237'), 'vectorized': 1},
   ...,
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead1f'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [2, 2, 0, 0], 'object_id': ObjectId('557beda61d41c8e4d1aead1d'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead27'), 'vectorized': 1}], dtype=object)

Also, 也,

>>> array = np.array([x['feature1'] for x in vec])

as suggested by another user, gives a similar output: 根据另一个用户的建议,给出了类似的输出:

>>> array
>>> array([[], [], [], ..., [], [2, 2, 0, 0], []], dtype=object)

I know I can access the contents of 'feature1' using array[i] , but what I want is to convert the dtype=object to dtype=float64 , and made into a list/dict in which each row will have the 'feature1' of the corresponding entry from vec . 我知道我可以使用array[i]访问'feature1'的内容,但是我想要的是将dtype=object转换为dtype=float64 ,并制成一个列表/字典,其中每行将具有'feature1'来自vec的相应条目。

I also tried using a pandas dataframe, but to no avail. 我也尝试使用pandas数据框,但无济于事。

    >>>>pandaseries = pd.Series(df['feature1']).convert_objects(convert_numeric=True)
    >>>>pandaseries
0     []
1     []
2     []
3     []
4     []
5     []
6     []
7     []
8     []
9     []
10    []
11    []
12    []
13    []
14    []
...
7021                                                   []
7022    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7023                                                   []
7024                                                   []
7025                                                   []
7026                                                   []
7027                                                   []
7028    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7029                                                   []
7030    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7031                                                   []
7032                                       [2, 2, 0.1, 0]
7033                                                   []
7034                                         [2, 2, 0, 0]
7035                                                   []
Name: feature1, Length: 7036, dtype: object
    >>> 

Again, dtype: object is returned. 同样,返回dtype: object My guess would be to loop over each row and print a list out. 我的猜测是遍历每一行并打印出一个列表。 But I am unable to do that. 但是我做不到。 Maybe it is a newbie question. 也许这是一个新手问题。 What am I doing wrong? 我究竟做错了什么?

Thanks. 谢谢。

This: 这个:

array = numpy.array ( [ x['feature1'] for x in ver ] )

Or you need to be more clear in your example... 否则您需要在示例中更加清楚...

You can access the value of a dictionary item by using its key: 您可以通过使用其键来访问字典项的值:

d ={'a':1}
d['a'] --> 1

To access items in a list, you can iterate over it or use its index 访问列表中的项目,可以对其进行迭代或使用其索引

a = [1,  2]

for thing in a:
    # do something with thing

a[0]  --> 1

map conveniently applies a function to all the items of an iterable and returns a list of the results . map方便地将函数应用于iterable的所有项目,并返回结果列表。 operator.getitem returns a function that will retrieve an item from an object. operator.getitem返回一个函数,该函数将从对象中检索项目。

import operator
import numpy as np
feature1 = operator.getitem('feature1')
a = np.asarray(map(feature1, vec))

vec = [{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
       {'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

>>> a = np.asanyarray(map(feature1, vec))
>>> a.shape
(2, 6)
>>> print a
[[ 2.          2.          2.          0.          0.03333333  0.        ]
 [ 2.          2.          1.          0.          0.5         0.        ]]
>>> 
>>> for thing in a[1,:]:
    print type(thing)

<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
>>> 

Lets take as the starting point a list of lists or equivalently an object array of lists: 让我们以列表列表或等效的列表对象数组为起点:

A = [[], [], [], [1,2,1], [], [2, 2, 0, 0], []]
A = array([[], [], [], [1,2,1], [], [2, 2, 0, 0], []], dtype=object)

If the sublists were all the same length, np.array([...]) would give you a 2d array, one row for each sublist, and columns matching their common length. 如果子列表的长度相同,则np.array([...])将为您提供一个2d数组,每个子列表一行,并且各列匹配其公共长度。 But since they are unequal in length, it can only make it a 1d array, where each element is a pointer to one of these sublists - ie dtype=object. 但是由于它们的长度不相等,因此只能使其成为一维数组,其中每个元素都是指向这些子列表之一的指针,即dtype = object。

I can imagine 2 ways of constructing a 2d array: 我可以想象构造2d数组的2种方法:

  • pad each sublist to a common length 将每个子列表填充到相同的长度
  • insert each sublist into an empty array of the appropriate size. 将每个子列表插入适当大小的空数组中。

Basically it requires common Python iteration; 基本上,它需要通用的Python迭代; it's not a common enough task to have a wiz-bang numpy function. 具有wiz-bang numpy函数并不是一项足够常见的任务。

For example: 例如:

In [346]: n=len(A)
In [348]: m=max([len(x) for x in A])
In [349]: AA=np.zeros((n,m),int)
In [350]: for i,x in enumerate(A):
   .....:     AA[i,:len(x)] = x
In [351]: AA
Out[351]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 2, 1, 0],
       [0, 0, 0, 0],
       [2, 2, 0, 0],
       [0, 0, 0, 0]])

To get a sparse matrix: 要获得稀疏矩阵:

In [352]: from scipy import sparse
In [353]: MA=sparse.coo_matrix(AA)
In [354]: MA
Out[354]: 
<7x4 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>

Nothing magical, just straight forward sparse matrix construction. 没什么神奇的,只是简单的稀疏矩阵构造。 I suppose you could bypass the dense matrix 我想你可以绕开密集矩阵

There is a list-of-lists sparse format that looks a bit like your data. 有一个列表稀疏格式,看起来有点像您的数据。

In [356]: Ml=MA.tolil()

In [357]: Ml.rows
Out[357]: array([[], [], [], [0, 1, 2], [], [0, 1], []], dtype=object)

In [358]: Ml.data
Out[358]: array([[], [], [], [1, 2, 1], [], [2, 2], []], dtype=object)

Conceivably you could construct an empty sparse.lil_matrix((n,m)) matrix, and set it's .data attribute directly. 可以想象,您可以构造一个空的sparse.lil_matrix((n,m))矩阵,并将其直接设置为.data属性。 But you'd also have to calculate the rows attribute. 但是您还必须计算rows属性。

You could also look at the data , row . 您还可以查看data row col attributes of the coo format matrix, and decide it would be easy to construct the equivalent from your A list of lists. col中的属性coo格式矩阵,并决定它会很容易从你的构造相当于A名单列表。

One way or other you have to decide how the non-zero rows get padded to the full length. 您必须决定采用哪种方式将非零行填充为完整长度。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 dtype 将 dataframe 从 float64 转换为 object - converting a dataframe from float64 to object using dtype Numpy.dot TypeError:根据规则&#39;safe&#39;,无法将数组数据从dtype(&#39;float64&#39;)转换为dtype(&#39;S32&#39;) - Numpy.dot TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe' 在 DataSeries 中将 Dtype“object”更改为 Dtype“float64” - Changing Dtype "object" to Dtype "float64" in DataSeries TypeError:无法从dtype(&#39; - TypeError: Cannot cast array data from dtype('<U1') to dtype('float64') according to rule 'safe' 根据规则&#39;safe&#39;,无法将数组数据从dtype(&#39;float64&#39;)转换为dtype(&#39;int32&#39;) - Cannot cast array data from dtype('float64') to dtype('int32') according to the rule 'safe' 无法从 dtype(&#39; - Cannot cast array data from dtype('<M8[ns]') to dtype('float64') according to the rule 'safe' 类型错误:无法根据规则“安全”将数组数据从 dtype(&#39;O&#39;) 转换为 dtype(&#39;float64&#39;) - TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe' 无法将数组数据从dtype(&#39;O&#39;)转换为dtype(&#39;float64&#39;) - Cannot cast array data from dtype('O') to dtype('float64') 无法根据“安全”将数组数据从 dtype(&#39;float64&#39;) 转换为 dtype(&#39;int32&#39;) - Cannot cast array data from dtype('float64') to dtype('int32') according to 'safe' 无法根据规则“安全”将数组数据从 dtype(&#39;O&#39;) 转换为 dtype(&#39;float64&#39;) - Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM