简体   繁体   English

将 Pandas 数据帧矢量化为 Numpy 数组

[英]Vectorize Pandas Dataframe into Numpy Array

I have a problem where I need to convert a pandas dataframe into an array of list of lists.我有一个问题,我需要将 Pandas 数据帧转换为列表列表数组。

Sample:样品:

import pandas as pd
df = pd.DataFrame([[1,2,3],[2,2,4],[3,2,4]])

I know there is the as_matrix() function which returns below:我知道有 as_matrix() 函数返回如下:

df.as_matrix():
# result:array([[1, 2, 3],
                [2, 2, 4],
                [3, 2, 4]])

However, I require something in this format但是,我需要这种格式的东西

  [array([[1], [2], [3]]),
   array([[2], [2], [4]],
   array([[3], [2], [4]])]

IE. IE浏览器。 I need a list of arrays containing list of lists where the inner most list contains a single element and the outer most list in the array represents the row of the dataframe.我需要一个包含列表的数组列表,其中最里面的列表包含一个元素,数组中最外面的列表表示数据帧的行。 The effect of this is that it basically vectorizes each row of the dataframe into a vector of dimension 3.这样做的效果是它基本上将数据帧的每一行向量化为一个维度为 3 的向量。

This is useful especially when I need to do matrix / vector operations in numpy and currently the data source I have is in .csv format and I am struggling to find a way to convert a dataframe into a vector.这非常有用,尤其是当我需要在 numpy 中进行矩阵/向量操作并且当前我拥有的数据源是 .csv 格式并且我正在努力寻找一种将数据帧转换为向量的方法时。

Extract the underlying array data , add a newaxis along the last one and then split along the first axis with np.vsplit -提取底层数组数据,沿最后一个轴添加一个新轴,然后使用np.vsplit沿第一个轴np.vsplit -

np.vsplit(df.values[...,None],df.shape[0])

Sample run -样品运行 -

In [327]: df
Out[327]: 
   0  1  2
0  1  2  3
1  2  2  4
2  3  2  4

In [328]: expected_output = [np.array([[1], [2], [3]]),
     ...: np.array([[2], [2], [4]]),
     ...: np.array([[3], [2], [4]])]

In [329]: expected_output
Out[329]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

In [330]: np.vsplit(df.values[...,None],df.shape[0])
Out[330]: 
[array([[[1],
         [2],
         [3]]]), array([[[2],
         [2],
         [4]]]), array([[[3],
         [2],
         [4]]])]

If you are working with NumPy funcs, then in most scenarios, you should be able to do away with the splitting and directly use the extended array version.如果您正在使用 NumPy funcs,那么在大多数情况下,您应该能够取消拆分并直接使用扩展数组版本。

Now, under the hoods np.vsplit makes use of np.array_split and that's basically a loop.现在, np.vsplit使用np.array_split ,这基本上是一个循环。 So, a bit more performant way would be to avoid the function overhead, like so -因此,更高效的方法是避免函数开销,就像这样 -

np.array_split(df.values[...,None],df.shape[0])

Note that this would have one extra dimension than as listed in the expected output.请注意,这将比预期输出中列出的多一个维度。 If you want a squeezed out version, we could use a list comprehension on the new-axis extended array version, like so -如果你想要一个挤出版本,我们可以在新轴扩展数组版本上使用列表理解,就像这样 -

In [357]: [i for i in df.values[...,None]]
Out[357]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

Thus, another way would be to add the new axis within the looping -因此,另一种方法是在循环中添加新轴 -

[i[...,None] for i in df.values]

First convert your DataFrame to a matrix.首先将您的 DataFrame 转换为矩阵。 Then add a dimension and convert it to a list.然后添加维度并将其转换为列表。

Try:尝试:

df = pd.DataFrame([[1,2,3],[2,2,4],[3,2,4]])
my_matrix = df.as_matrix()
my_list_of_arrays_of_list_lists = list(np.expand_dims(my_matrix, axis=2))

my_list_of_arrays_of_list_lists represents what you are looking for and gives you: my_list_of_arrays_of_list_lists代表您正在寻找的内容并为您提供:

Out[42]: [array([[1],[2],[3]]),
          array([[2],[2],[4]]),
          array([[3],[2],[4]])]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM