简体   繁体   中英

Interpreting numpy.int64 datatype as native int datatype in Python on windows x64

Background:

I ran into problem executing code from a machine learning case. I've already solved the issue with an ugly workaround so I am able to execute the notebook, but I still do not fully understand the cause of the issue.

The issues arises when I try to execute the following code which is used to create dummy variables using OneHotEncoder from sklearn.

categorical_columns = ~np.in1d(train_X.dtypes, [int, float])

Although the codes executes without any error, it fails to recognize the numpy.int64 as int datatype therefore classifying all int64 datatype columns as categorical and parsing them into the OneHotEncoder.

train_X is a pandas dataframe object with the following columns and datatypes, as you can see the integers are stored as numpy.int64.

数据框

The code was originally written in Jupyter Notebook on a Mac where it worked fine and it also ran fine in Colaboraty on the Google cloud. All others who tried running the code from Jupyter on their almost identical Windows machines had the same issue as I did when running the script.

The Problem:

It seems that on windows machines, the numpy.int64 is not linked to the native int datatype.

Things I've tried and verified

  1. Although dated and based on python 2.7.x this post made me believe it was a version issue, so I verified:
    • My machine is running on a 64bit version of windows 10
    • Python is installed as 64 bit
    • Anaconda is also installed as 64 bit
    • Used a clean environment with just pandas, numpy, sklearn and dependencies, all updated to their lastest version
    • When I run python I get the following:

终奌站

I noted the strange "on win32" here but it seems merely a product of the "infinite wisdom of Microsoft" according to post 1 and post 2

  1. I tried understanding the issue by reading 1 , 2 and 3 . I've managed to compute several workarounds based on these but I still do not understand why the code works on one system but not on another.

Question:

Why does numpy.int64 not translate into a native int datatype on Windows while everything is running 64 bit, where it does on Mac and other systems?

I don't have an answer as to why the default int on Windows 64 is int32 but it is a very confusing fact:

np.dtype('int') returns dtype('int32') on 64 bit Windows and dtype('int64') on 64 bit Linux.

See also the second warning here and this numpy github issue .

In your concrete case I'd use pandas' is_numeric_dtype function to check numeric-ness in a platform independed and straightforward way:

from pandas.api.types import is_numeric_dtype
categorical_columns = ~train_X.dtypes.apply(is_numeric_dtype).to_numpy()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM