如何将 Pandas 数据框转换为带有列名的 numpy 数组

Question

This must use vectorized methods, nothing iterative这必须使用矢量化方法，没有迭代

I would like to create a numpy array from pandas dataframe.我想从熊猫数据帧创建一个 numpy 数组。

My code:我的代码：

import pandas as pd
_df = pd.DataFrame({'itme': ['book', 'book' , 'car', ' car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
 
item     color    val
book    green   -22.70
book    blue    -109.60
car     red     -57.19
car     green   -11.20
bike    blue    -25.60
bike    red     -33.61

There are about 12k million rows.大约有 12,000 行。

I need to create a numpy array like :我需要创建一个 numpy 数组，如：

item    green    blue     red
book    -22.70  -109.60   null
car     -11.20   null     -57.19
bike    null    -25.60    -33.16

each row is the item name and each col is color name.每行是项目名称，每列是颜色名称。 The order of the items and colors are not important.项目和颜色的顺序并不重要。 But, in numpy array, there are no row and column names, I need to keep the item and color name for each value, so that I know what the value represents in the numpy array.但是，在 numpy 数组中，没有行名和列名，我需要保留每个值的项目和颜色名称，以便我知道该值在 numpy 数组中代表什么。

For example例如

 how to know that -57.19 is for "car" and "red" in numpy array ?

So, I need to create a dictionary to keep the mapping between :所以，我需要创建一个字典来保持以下之间的映射：

  item <--> row index in the numpy array
  color <--> col index in the numpy array

I do not want to use iteritems and itertuples because they are not efficient for large dataframe due to How to iterate over rows in a DataFrame in Pandas and How to iterate over rows in a DataFrame in Pandas and Python Pandas iterate over rows and access column names and Does pandas iterrows have performance issues?我不想使用 iteritems 和 itertuples，因为它们对大型数据帧效率不高，因为如何在 Pandas 中的 DataFrame 中迭代行以及如何在 Pandas 中的 DataFrame 中迭代行和Python Pandas 迭代行并访问列名和熊猫 iterrows 有性能问题吗？

I prefer numpy vectorization solution for this.为此，我更喜欢 numpy 矢量化解决方案。

How to efficiently convert the pandas dataframe to numpy array ?如何有效地将熊猫数据帧转换为 numpy 数组？ The array will also be transformed to torch.tensor.该数组也将转换为 torch.tensor。

thanks谢谢

Answer 1

do a quick search for a val by their "item" and "color" with one of the following options:使用以下选项之一通过“项目”和“颜色”快速搜索 val ：
1. Use pandas Boolean indexing使用熊猫布尔索引
2. Convert the dataframe into a numpy.recarry using pandas.DataFrame.to_records , and also use Boolean indexing使用pandas.DataFrame.to_records将数据帧转换为numpy.recarry ，并使用布尔索引
.item is a method for both pandas and numpy , so don't use 'item' as a column name. .item是pandas和numpy的方法，所以不要使用'item'作为列名。 It has been changed to '_item' .它已更改为'_item' 。
As an FYI, numpy is a pandas dependency, and much of pandas vectorized functionality directly corresponds to numpy .仅供参考， numpy是一个pandas依赖项，并且许多pandas向量化功能直接对应于numpy 。

import pandas as pd
import numpy as np

# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})

# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]

# print(selected)
_item color    val
 book  blue -109.6

# Alternatively, create a recarray
v = df.to_records(index=False)

# display(v)
rec.array([('book', 'green',  -22.7 ), ('book', 'blue', -109.6 ),
           ('car', 'red',  -57.19), ('car', 'green',  -11.2 ),
           ('bike', 'blue',  -25.6 ), ('bike', 'red',  -33.61)],
          dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])

# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]

# print(selected)
[('book', 'blue', -109.6)]

Update in response to OP edit更新以响应 OP 编辑

You must first reshape the dataframe using pandas.DataFrame.pivot , and then use the previously mentioned methods.您必须首先使用pandas.DataFrame.pivot重塑数据pandas.DataFrame.pivot ，然后使用前面提到的方法。

dfp = df.pivot(index='_item', columns='color', values='val')

# display(dfp)
color   blue  green    red
_item                     
bike   -25.6    NaN -33.61
book  -109.6  -22.7    NaN
car      NaN  -11.2 -57.19

# create a numpy recarray
v = dfp.to_records(index=True)

# display(v)
rec.array([('bike',  -25.6,   nan, -33.61),
           ('book', -109.6, -22.7,    nan),
           ('car',    nan, -11.2, -57.19)],
          dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])

# select data
selected = v.blue[(v._item == 'book')]

# print(selected)
array([-109.6])

如何将 Pandas 数据框转换为带有列名的 numpy 数组

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-11-14 23:52:40

Update in response to OP edit更新以响应 OP 编辑

如何将 Pandas 数据框转换为带有列名的 numpy 数组

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-11-14 23:52:40

Update in response to OP edit更新以响应 OP 编辑

解决方案1
2 已采纳 2020-11-14 23:52:40