[英]Pandas says every column is an object, even though I think it's an integer
I have a data frame that is somehow all objects - which I think should be okay.我有一个以某种方式包含所有对象的数据框 - 我认为应该没问题。 Notice that the first column has values like "10180.".请注意,第一列的值类似于“10180.”。
Problem solved: There was some kind of weird unicode thing going on.问题已解决:发生了某种奇怪的 unicode 事情。 My task lead solved the problem.我的任务负责人解决了这个问题。 We just read it in as straight excel instead of converting to csv (I was using libreoffice to do that).我们只是将它作为直接的 excel 读入而不是转换为 csv(我使用 libreoffice 来做到这一点)。 Problem solved.问题解决了。 A big hint was all these things that "should" work that were not working.一个重要的提示是所有这些“应该”起作用但不起作用的东西。
Those should all be "10180" - no decimal.这些都应该是“10180”——没有小数。 (Note that in Jupyter it displays correctly. Only should up as a decimal when I output as csv. However Jupyter does know that it's an object.) (请注意,在 Jupyter 中它显示正确。只有当我输出为 csv 时才应该显示为十进制。但是 Jupyter 确实知道它是一个对象。)
Another problem is potentially the data values that look like "2,361.9".另一个问题可能是看起来像“2,361.9”的数据值。 Those should be floats.那些应该是浮点数。 I thought I could do a similar thing with those to get rid of the commas and then convert.我想我可以对那些做类似的事情来摆脱逗号然后转换。
Sample data:样本数据:
CBSA Code,CBSA Title,violent,murder,rape,robbery,assault,property,burglary,larceny,vehicle theft
10180.0,"Abilene, TX",393.2,5.3,64.0,65.7,258.2,"2,361.9",534.0,"1,670.0",157.8
10420.0,"Akron, OH",361.6,6.4,48.7,73.0,233.6,"2,226.0",415.6,"1,659.4",150.9
10500.0,"Albany, GA",728.5,11.6,30.6,95.1,591.3,"3,734.5",773.4,"2,715.1",246.0
10580.0,"Albany-Schenectady-Troy, NY",283.7,2.2,38.3,62.4,180.8,"1,892.3",226.9,"1,584.8",80.6
That first column should be integer.第一列应该是整数。 I've tried我试过了
df[‘CBSA Code’].apply(np.int64) AND
df[‘CBSA Code’].astype(int) AND
df[‘CBSA Code’].astype(str).astype(int) AND
df[‘CBSA Code’] = df[‘CBSA Code’].astype(str)
df[‘CBSA Code’] = df[‘CBSA Code’].replace(“.0”, ’’)
df[‘CBSA Code’] = df[‘CBSA Code’].astype(‘int’)
I've seen some of these posted as answers in other stackoverflow questions.我已经看到其中一些作为其他 stackoverflow 问题的答案发布。 But it's not working for me.但这对我不起作用。 This must be a common dilemma.这应该是一个普遍的困境。 Is there a canonical way of doing this?有没有规范的方法来做到这一点?
The error msg with the df['CBSA Code'].apply(np.int64) follows带有 df['CBSA Code'].apply(np.int64) 的错误信息如下
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-189-6c1c6381a02c> in <module>
----> 1 df['CBSA Code'].apply(np.int64)
~\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3589 else:
3590 values = self.astype(object).values
-> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype)
3592
3593 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
ValueError: invalid literal for int() with base 10: '10180.0'
如果问题是CBSA Code
列是一个格式化为字符串的浮点数(从错误消息看来: ValueError: invalid literal for int() with base 10: '10180.0'
),那么您不能直接转换为int,但您可以先将其转换为 float,然后再转换为 int:
df["CBSA Code"] = df["CBSA Code"].astype(float).astype(int)
I suspect CBSA Code
has some non-numeric values, so read_csv defaults it to dtype object
.我怀疑CBSA Code
有一些非数字值,因此 read_csv 将其默认为 dtype object
。 You may try using nullable integer dtype Int64
( note : it is uppercase 'I'
)您可以尝试使用可为空的整数 dtype Int64
(注意:它是大写的'I'
)
df['CBSA Code'] = pd.to_numeric(df['CBSA Code'], errors='coerce').astype('Int64')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.