使用“object”类型的 numpy 数组创建混合类型 Pandas Dataframe

Question

I have a pandas Dataframe with mixed datatypes (float64 and strings), to use it in a sklearn Pipeline I need to convert it to a numpy array.我有一个具有混合数据类型（float64 和字符串）的 pandas Dataframe，要在 sklearn 管道中使用它，我需要将其转换为 numpy 数组。 In the end of the Pipeline I want to make a Dataframe again.在流水线的最后，我想再次制作一个 Dataframe。

The problem is, when creating a numpy array with mixed types all data is converted to dtype "object".问题是，在创建具有混合类型的 numpy 数组时，所有数据都将转换为 dtype “object”。 That way, when I create a new dataframe at the end all data is categorical.这样，当我最后创建一个新的 dataframe 时，所有数据都是分类的。

Example:例子：

Dataframe with mixed data Dataframe 带混合数据

>>> dataframe = pd.DataFrame([[1,2,3],["a","b","c"]], columns = ["num", "cat"])

>>> dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      int64 
 1   cat     3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes

To numpy array至 numpy 阵列

>>> array = dataframe.to_numpy()

array([[1, 'a'],
       [2, 'b'],
       [3, 'c']], dtype=object)

Back to dataframe返回 dataframe

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"])

>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      object
 1   cat     3 non-null      object
dtypes: object(2)
memory usage: 176.0+ bytes

Now the two columns are categorical.现在这两列是分类的。

Is there a way to make pandas recognize the true data types inside the numpy array?有没有办法让 pandas 识别 numpy 数组中的真实数据类型？

Answer 1

If you are using pandas >= 1.0, there's convert_dtypes :如果您使用的是 pandas >= 1.0，则有convert_dtypes ：

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"]).convert_dtypes()
>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      Int64 
 1   cat     3 non-null      string
dtypes: Int64(1), string(1)
memory usage: 179.0 bytes

Answer 2

you can use infer_objects() as well:您也可以使用infer_objects() ：

new_df = pd.DataFrame(array, columns = ["num", "cat"]).infer_objects()
print(new_df,'\n\n',new_df.dtypes)

  num cat
0    1   a
1    2   b
2    3   c 

num     int64
cat    object
dtype: object

使用“object”类型的 numpy 数组创建混合类型 Pandas Dataframe

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-04-21 14:36:46

解决方案2
2 2020-04-21 14:37:10

使用“object”类型的 numpy 数组创建混合类型 Pandas Dataframe

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-04-21 14:36:46

解决方案2 2 2020-04-21 14:37:10

解决方案1
2 已采纳 2020-04-21 14:36:46

解决方案2
2 2020-04-21 14:37:10