[英]How to convert a pandas dataframe into a numpy array with the column names
I would like to create a numpy array from pandas dataframe.我想从熊猫数据帧创建一个 numpy 数组。
My code:我的代码:
import pandas as pd
_df = pd.DataFrame({'itme': ['book', 'book' , 'car', ' car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
item color val
book green -22.70
book blue -109.60
car red -57.19
car green -11.20
bike blue -25.60
bike red -33.61
There are about 12k million rows.大约有 12,000 行。
I need to create a numpy array like :我需要创建一个 numpy 数组,如:
item green blue red
book -22.70 -109.60 null
car -11.20 null -57.19
bike null -25.60 -33.16
each row is the item name and each col is color name.每行是项目名称,每列是颜色名称。 The order of the items and colors are not important.
项目和颜色的顺序并不重要。 But, in numpy array, there are no row and column names, I need to keep the item and color name for each value, so that I know what the value represents in the numpy array.
但是,在 numpy 数组中,没有行名和列名,我需要保留每个值的项目和颜色名称,以便我知道该值在 numpy 数组中代表什么。
For example例如
how to know that -57.19 is for "car" and "red" in numpy array ?
So, I need to create a dictionary to keep the mapping between :所以,我需要创建一个字典来保持以下之间的映射:
item <--> row index in the numpy array
color <--> col index in the numpy array
I do not want to use iteritems and itertuples because they are not efficient for large dataframe due to How to iterate over rows in a DataFrame in Pandas and How to iterate over rows in a DataFrame in Pandas and Python Pandas iterate over rows and access column names and Does pandas iterrows have performance issues?我不想使用 iteritems 和 itertuples,因为它们对大型数据帧效率不高,因为如何在 Pandas 中的 DataFrame 中迭代行以及如何在 Pandas 中的 DataFrame 中迭代行和Python Pandas 迭代行并访问列名和熊猫 iterrows 有性能问题吗?
I prefer numpy vectorization solution for this.为此,我更喜欢 numpy 矢量化解决方案。
How to efficiently convert the pandas dataframe to numpy array ?如何有效地将熊猫数据帧转换为 numpy 数组? The array will also be transformed to torch.tensor.
该数组也将转换为 torch.tensor。
thanks谢谢
numpy.recarry
using pandas.DataFrame.to_records
, and also use Boolean indexingpandas.DataFrame.to_records
将数据帧转换为numpy.recarry
,并使用布尔索引.item
is a method for both pandas
and numpy
, so don't use 'item'
as a column name. .item
是pandas
和numpy
的方法,所以不要使用'item'
作为列名。 It has been changed to '_item'
.'_item'
。numpy
is a pandas
dependency, and much of pandas
vectorized functionality directly corresponds to numpy
.numpy
是一个pandas
依赖项,并且许多pandas
向量化功能直接对应于numpy
。import pandas as pd
import numpy as np
# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]
# print(selected)
_item color val
book blue -109.6
# Alternatively, create a recarray
v = df.to_records(index=False)
# display(v)
rec.array([('book', 'green', -22.7 ), ('book', 'blue', -109.6 ),
('car', 'red', -57.19), ('car', 'green', -11.2 ),
('bike', 'blue', -25.6 ), ('bike', 'red', -33.61)],
dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])
# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]
# print(selected)
[('book', 'blue', -109.6)]
pandas.DataFrame.pivot
, and then use the previously mentioned methods.pandas.DataFrame.pivot
重塑数据pandas.DataFrame.pivot
,然后使用前面提到的方法。dfp = df.pivot(index='_item', columns='color', values='val')
# display(dfp)
color blue green red
_item
bike -25.6 NaN -33.61
book -109.6 -22.7 NaN
car NaN -11.2 -57.19
# create a numpy recarray
v = dfp.to_records(index=True)
# display(v)
rec.array([('bike', -25.6, nan, -33.61),
('book', -109.6, -22.7, nan),
('car', nan, -11.2, -57.19)],
dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])
# select data
selected = v.blue[(v._item == 'book')]
# print(selected)
array([-109.6])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.