Pandas says every column is an object, even though I think it's an integer

Question

I have a data frame that is somehow all objects - which I think should be okay. Notice that the first column has values like "10180.".

Problem solved: There was some kind of weird unicode thing going on. My task lead solved the problem. We just read it in as straight excel instead of converting to csv (I was using libreoffice to do that). Problem solved. A big hint was all these things that "should" work that were not working.

Those should all be "10180" - no decimal. (Note that in Jupyter it displays correctly. Only should up as a decimal when I output as csv. However Jupyter does know that it's an object.)

Another problem is potentially the data values that look like "2,361.9". Those should be floats. I thought I could do a similar thing with those to get rid of the commas and then convert.

Sample data:

CBSA Code,CBSA Title,violent,murder,rape,robbery,assault,property,burglary,larceny,vehicle theft
10180.0,"Abilene, TX",393.2,5.3,64.0,65.7,258.2,"2,361.9",534.0,"1,670.0",157.8
10420.0,"Akron, OH",361.6,6.4,48.7,73.0,233.6,"2,226.0",415.6,"1,659.4",150.9
10500.0,"Albany, GA",728.5,11.6,30.6,95.1,591.3,"3,734.5",773.4,"2,715.1",246.0
10580.0,"Albany-Schenectady-Troy, NY",283.7,2.2,38.3,62.4,180.8,"1,892.3",226.9,"1,584.8",80.6

That first column should be integer. I've tried

df[‘CBSA Code’].apply(np.int64)  AND

df[‘CBSA Code’].astype(int) AND

df[‘CBSA Code’].astype(str).astype(int) AND

df[‘CBSA Code’] = df[‘CBSA Code’].astype(str)
df[‘CBSA Code’] = df[‘CBSA Code’].replace(“.0”, ’’)
df[‘CBSA Code’] = df[‘CBSA Code’].astype(‘int’)

I've seen some of these posted as answers in other stackoverflow questions. But it's not working for me. This must be a common dilemma. Is there a canonical way of doing this?

The error msg with the df['CBSA Code'].apply(np.int64) follows

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-189-6c1c6381a02c> in <module>
----> 1 df['CBSA Code'].apply(np.int64)

~\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   3589             else:
   3590                 values = self.astype(object).values
-> 3591                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3592 
   3593         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

ValueError: invalid literal for int() with base 10: '10180.0'

Answer 1

如果问题是CBSA Code列是一个格式化为字符串的浮点数（从错误消息看来： ValueError: invalid literal for int() with base 10: '10180.0' ），那么您不能直接转换为int，但您可以先将其转换为 float，然后再转换为 int：

df["CBSA Code"] = df["CBSA Code"].astype(float).astype(int)

Answer 2

I suspect CBSA Code has some non-numeric values, so read_csv defaults it to dtype object . You may try using nullable integer dtype Int64 ( note : it is uppercase 'I' )

df['CBSA Code'] = pd.to_numeric(df['CBSA Code'], errors='coerce').astype('Int64')

Pandas says every column is an object, even though I think it's an integer

Question

2 answers

solution1
1 2019-12-06 19:42:34

solution2
1 2019-12-06 19:51:16

Pandas says every column is an object, even though I think it's an integer

Question

2 answers

solution1 1 2019-12-06 19:42:34

solution2 1 2019-12-06 19:51:16

solution1
1 2019-12-06 19:42:34

solution2
1 2019-12-06 19:51:16