简体   繁体   English

使用“object”类型的 numpy 数组创建混合类型 Pandas Dataframe

[英]Create a mixed type Pandas Dataframe using an numpy array of type “object”

I have a pandas Dataframe with mixed datatypes (float64 and strings), to use it in a sklearn Pipeline I need to convert it to a numpy array.我有一个具有混合数据类型(float64 和字符串)的 pandas Dataframe,要在 sklearn 管道中使用它,我需要将其转换为 numpy 数组。 In the end of the Pipeline I want to make a Dataframe again.在流水线的最后,我想再次制作一个 Dataframe。

The problem is, when creating a numpy array with mixed types all data is converted to dtype "object".问题是,在创建具有混合类型的 numpy 数组时,所有数据都将转换为 dtype “object”。 That way, when I create a new dataframe at the end all data is categorical.这样,当我最后创建一个新的 dataframe 时,所有数据都是分类的。

Example:例子:

Dataframe with mixed data Dataframe 带混合数据

>>> dataframe = pd.DataFrame([[1,2,3],["a","b","c"]], columns = ["num", "cat"])

>>> dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      int64 
 1   cat     3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes

To numpy array至 numpy 阵列

>>> array = dataframe.to_numpy()

array([[1, 'a'],
       [2, 'b'],
       [3, 'c']], dtype=object)

Back to dataframe返回 dataframe

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"])

>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      object
 1   cat     3 non-null      object
dtypes: object(2)
memory usage: 176.0+ bytes

Now the two columns are categorical.现在这两列是分类的。

Is there a way to make pandas recognize the true data types inside the numpy array?有没有办法让 pandas 识别 numpy 数组中的真实数据类型?

If you are using pandas >= 1.0, there's convert_dtypes :如果您使用的是 pandas >= 1.0,则有convert_dtypes

>>> new_df = pd.DataFrame(array, columns = ["num", "cat"]).convert_dtypes()
>>> new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     3 non-null      Int64 
 1   cat     3 non-null      string
dtypes: Int64(1), string(1)
memory usage: 179.0 bytes

you can use infer_objects() as well:您也可以使用infer_objects()

new_df = pd.DataFrame(array, columns = ["num", "cat"]).infer_objects()
print(new_df,'\n\n',new_df.dtypes)

  num cat
0    1   a
1    2   b
2    3   c 

num     int64
cat    object
dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM