简体   繁体   English

Pandas:将 dtype 'object' 转换为 int

[英]Pandas: convert dtype 'object' to int

I've read an SQL query into Pandas and the values are coming in as dtype 'object', although they are strings, dates and integers.我已经阅读了 SQL 查询到 Pandas 并且值以 dtype 'object' 的形式出现,尽管它们是字符串、日期和整数。 I am able to convert the date 'object' to a Pandas datetime dtype, but I'm getting an error when trying to convert the string and integers.我能够将日期“对象”转换为 Pandas 日期时间 dtype,但是在尝试转换字符串和整数时出现错误。

Here is an example:这是一个例子:

>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
    id    date          purchase
 1  abc1  2016-05-22    1
 2  abc2  2016-05-29    0
 3  abc3  2016-05-22    2
 4  abc4  2016-05-22    0

>>> df.dtypes
 id          object
 date        object
 purchase    object
 dtype: object

Converting the df['date'] to a datetime works:df['date']转换为日期时间有效:

>>> pd.to_datetime(df['date'])
 1  2016-05-22
 2  2016-05-29
 3  2016-05-22
 4  2016-05-22
 Name: date, dtype: datetime64[ns] 

But I get an error when trying to convert the df['purchase'] to an integer:但是在尝试将df['purchase']转换为 integer 时出现错误:

>>> df['purchase'].astype(int)
 ....
 pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
 pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()

 TypeError: long() argument must be a string or a number, not 'java.lang.Long'

NOTE: I get a similar error when I tried .astype('float')注意:当我尝试.astype('float')时出现类似错误

And when trying to convert to a string, nothing seems to happen.当尝试转换为字符串时,似乎什么也没有发生。

>>> df['id'].apply(str)
 1 abc1
 2 abc2
 3 abc3
 4 abc4
 Name: id, dtype: object

Documenting the answer that worked for me based on the comment by @piRSquared.根据@piRSquared 的评论记录对我有用的答案。

I needed to convert to a string first, then an integer.我需要先转换为字符串,然后是整数。

>>> df['purchase'].astype(str).astype(int)

pandas >= 1.0熊猫 >= 1.0

convert_dtypes

The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns. (自我)接受的答案没有考虑对象列中 NaN 的可能性。

df = pd.DataFrame({
     'a': [1, 2, np.nan], 
     'b': [True, False, np.nan]}, dtype=object) 
df                                                                         

     a      b
0    1   True
1    2  False
2  NaN    NaN

df['a'].astype(str).astype(int) # raises ValueError

This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail.这会令人窒息,因为 NaN 被转换为字符串“nan”,进一步尝试强制转换为整数将失败。 To avoid this issue, we can soft-convert columns to their corresponding nullable type using convert_dtypes :为了避免这个问题,我们可以使用convert_dtypes将列软转换为其相应的可为空类型

df.convert_dtypes()                                                        

      a      b
0     1   True
1     2  False
2  <NA>   <NA>

df.convert_dtypes().dtypes                                                 

a      Int64
b    boolean
dtype: object

If your data has junk text mixed in with your ints, you can use pd.to_numeric as an initial step:如果您的数据中混有垃圾文本和整数,您可以使用pd.to_numeric作为初始步骤:

s = pd.Series(['1', '2', '...'])
s.convert_dtypes()  # converts to string, which is not what we want

0      1
1      2
2    ...
dtype: string 

# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    NaN
dtype: float64

# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes() 

0       1
1       2
2    <NA>
dtype: Int64

It's simple很简单

pd.factorize(df.purchase)[0]

Example:示例:

labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`
labels
# array([0, 0, 1, 2, 0])
uniques
# array(['b', 'a', 'c'], dtype=object)

My train data contains three features are object after applying astype it converts the object into numeric but before that, you need to perform some preprocessing steps:我的训练数据包含三个特征是对象,在应用astype它将对象转换为数字,但在此之前,您需要执行一些预处理步骤:

train.dtypes

C12       object
C13       object
C14       Object

train['C14'] = train.C14.astype(int)

train.dtypes

C12       object
C13       object
C14       int32

Follow these steps:请按照以下步骤操作:

1.clean your file -> open your datafile in csv format and see that there is "?" 1.清理你的文件->以csv格式打开你的数据文件,看到有“?” in place of empty places and delete all of them.代替空的地方并删除所有这些。

2.drop the rows containing missing values eg: 2.删除包含缺失值的行,例如:

df.dropna(subset=["normalized-losses"], axis = 0 , inplace= True)

3.use astype now for conversion 3.现在使用astype进行转换

df["normalized-losses"]=df["normalized-losses"].astype(int)

Note: If still finding erros in your program then again inspect your csv file, open it in excel to find whether is there an "?"注意:如果在你的程序中仍然发现错误,那么再次检查你的csv文件,在 excel 中打开它以查看是否有“?” in your required column, then delete it and save file and go back and run your program.在您需要的列中,然后将其删除并保存文件并返回并运行您的程序。

comment success!评论成功! if it works.如果它有效。 :) :)

Cannot comment so posting this as an answer, which is somewhat in between @piRSquared / @cyril 's solution and @cs95 's:无法发表评论,因此将其作为答案发布,这有点介于@piRSquared / @cyril的解决方案和@cs95的解决方案之间:

As noted by @cs95, if your data contains NaNs or Nones, converting to string type will throw an error when trying to convert to int afterwards.正如@cs95 所指出的,如果您的数据包含 NaN 或 Nones,则在之后尝试转换为 int 时,转换为字符串类型将引发错误。

However, if your data consists of (numerical) strings, using convert_dtypes will convert it to string type unless you use pd.to_numeric as suggested by @cs95 (potentially combined with df.apply() ).但是,如果您的数据组成的(数字)串,用convert_dtypes将它,除非你使用转换为字符串类型pd.to_numeric通过@ cs95的建议(可能与合并df.apply()

In the case that your data consists only of numerical strings (including NaNs or Nones but without any non-numeric "junk"), a possibly simpler alternative would be to convert first to float and then to one of the nullable-integer extension dtypes provided by pandas (already present in version 0.24) (see also this answer ):如果您的数据仅包含数字字符串(包括 NaN 或 None 但没有任何非数字“垃圾”),一个可能更简单的替代方法是先转换为浮点数,然后转换为提供的可空整数扩展 dtypes 之一由熊猫(已存在于 0.24 版中)(另请参阅此答案):

df['purchase'].astype(float).astype('Int64')

Note that there has been recent discussion on this on github (currently an -unresolved- closed issue though) and that in the case of very long 64-bit integers you may have to convert explicitly to float128 to avoid approximations during the conversions.请注意,最近在github上对此进行了讨论(尽管目前是一个未解决的已关闭问题),并且在非常长的 64 位整数的情况下,您可能必须显式转换为float128以避免在转换过程中出现近似值。

df['col_name'] = pd.to_numeric(df['col_name'])

这是一个更好的选择

In my case, I had a df with mixed data:就我而言,我有一个混合数据的 df:

df:
                     0   1   2    ...                  242                  243                  244
0   2020-04-22T04:00:00Z   0   0  ...          3,094,409.5         13,220,425.7          5,449,201.1
1   2020-04-22T06:00:00Z   0   0  ...          3,716,941.5          8,452,012.9          6,541,599.9
....

The floats are actually objects, but I need them to be real floats.花车实际上是物体,但我需要它们是真正的花车。

To fix it, referencing @AMC's comment above:要修复它,请参考上面@AMC 的评论:

def coerce_to_float(val):
    try:
       return float(val)
    except ValueError:
       return val

df = df.applymap(lambda x: coerce_to_float(x))

to change the data type and save it into the data frame, it is needed to replace the new data type as follows:要更改数据类型并将其保存到数据框中,需要按如下方式替换新的数据类型:

ds["cat"] = pd.to_numeric(ds["cat"]) or ds["cat"] = ds["cat"].astype(int) ds["cat"] = pd.to_numeric(ds["cat"]) 或 ds["cat"] = ds["cat"].astype(int)

如果这些方法都失败了,你可以尝试像这样的列表理解:

df["int_column"] = [int(x) if x.isnumeric() else x for x in df["str_column"] ]

use astype fuction to convert the datype of that column使用 astype 函数转换该列的数据类型

i am very much new to programming language.... started AIML course.. now here i was given a project to complete as part of my course.. here is where i was stuck .. could anyone suggest me some tips to continue my project.. 我对编程语言非常陌生。...开始了AIML课程。.现在在这里,我有一个项目要完成,这是我课程的一部分..这是我受困的地方..任何人都可以建议我一些继续我的技巧项目..

my doubt: 我的疑问:

There are some categorical columns in my dataset : Sex, Region,Smoker now to convert them to integer form i used cat_df[].value_counts() when gave print of isnull() it gave output of showing the presence of nullvalues. 我的数据集中有一些分类列:Sex,Region,Smoker现在将它们转换为整数形式,当我打印了isull()时,我使用cat_df []。value_counts()来输出显示空值的情况。 when gave cat_df[].head() it gave the first five rows in that category.. but when gave barplot or distplot it is giving an error 当给了cat_df []。head()时,它给出了该类别中的前五行。但是当给了barplot或distplot时,它给出了一个错误

my dout: whether the category column changed to integer or not... if not when i gaveprint(df[].value_counts() why it showed dtype: int64 ??? 我的道歉:类别列是否更改为整数...如果没有,当我给print(df []。value_counts()时为什么显示dtype:int64?

can anyone please suggest a solution... 任何人都可以提出解决方案...

Thanks in advance.. Ramya. 在此先感谢..拉米娅。

This was my data这是我的数据

## list of columns 
l1 = ['PM2.5', 'PM10', 'TEMP', 'BP', ' RH', 'WS','CO', 'O3', 'Nox', 'SO2'] 

for i in l1:
 for j in range(0, 8431): #rows = 8431
   df[i][j] = int(df[i][j])

I recommend you to use this only with small data.我建议你只对小数据使用它。 This code has complexity of O(n^2).这段代码的复杂度为 O(n^2)。

Converting object to numerical int or float .将 object 转换为数字intfloat

code is:--代码是:--

df["total_sqft"] = pd.to_numeric(df["total_sqft"], errors='coerce').fillna(0, downcast='infer')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM