[英]Create a mixed type Pandas Dataframe using an numpy array of type “object”
I have a pandas Dataframe with mixed datatypes (float64 and strings), to use it in a sklearn Pipeline I need to convert it to a numpy array.我有一个具有混合数据类型(float64 和字符串)的 pandas Dataframe,要在 sklearn 管道中使用它,我需要将其转换为 numpy 数组。 In the end of the Pipeline I want to make a Dataframe again.
在流水线的最后,我想再次制作一个 Dataframe。
The problem is, when creating a numpy array with mixed types all data is converted to dtype "object".问题是,在创建具有混合类型的 numpy 数组时,所有数据都将转换为 dtype “object”。 That way, when I create a new dataframe at the end all data is categorical.
这样,当我最后创建一个新的 dataframe 时,所有数据都是分类的。
Example:例子:
Dataframe with mixed data Dataframe 带混合数据
>>> dataframe = pd.DataFrame([[1,2,3],["a","b","c"]], columns = ["num", "cat"])
>>> dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num 3 non-null int64
1 cat 3 non-null object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
To numpy array至 numpy 阵列
>>> array = dataframe.to_numpy()
array([[1, 'a'],
[2, 'b'],
[3, 'c']], dtype=object)
Back to dataframe返回 dataframe
>>> new_df = pd.DataFrame(array, columns = ["num", "cat"])
>>> new_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num 3 non-null object
1 cat 3 non-null object
dtypes: object(2)
memory usage: 176.0+ bytes
Now the two columns are categorical.现在这两列是分类的。
Is there a way to make pandas recognize the true data types inside the numpy array?有没有办法让 pandas 识别 numpy 数组中的真实数据类型?
If you are using pandas >= 1.0, there's convert_dtypes
:如果您使用的是 pandas >= 1.0,则有
convert_dtypes
:
>>> new_df = pd.DataFrame(array, columns = ["num", "cat"]).convert_dtypes()
>>> new_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num 3 non-null Int64
1 cat 3 non-null string
dtypes: Int64(1), string(1)
memory usage: 179.0 bytes
you can use infer_objects()
as well:您也可以使用
infer_objects()
:
new_df = pd.DataFrame(array, columns = ["num", "cat"]).infer_objects()
print(new_df,'\n\n',new_df.dtypes)
num cat
0 1 a
1 2 b
2 3 c
num int64
cat object
dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.