简体   繁体   中英

What is the purpose of floating point index in Pandas?

s.index=[0.0,1.1,2.2,3.3,4.4,5.5]
s.index
# Float64Index([0.0, 1.1, 2.2, 3.3, 4.4, 5.5], dtype='float64')
s
# 0.0    141.125
# 1.1    142.250
# 2.2    143.375
# 3.3    143.375
# 4.4    144.500
# 5.5    145.125
s.index=s.index.astype('float32')
# s.index
# Float64Index([              0.0, 1.100000023841858, 2.200000047683716,
#               3.299999952316284, 4.400000095367432,               5.5],
#              dtype='float64')

What's the intuition behind floating point indices? Struggling to understand when we would use them instead of int indices (it seems like you can have three types of indices: int64, float64, or object, eg s.index=['a','b','c','d','e','f'] ).

From the code above, it also looks like Pandas really wants float indices to be in 64-bit, as these 64-bit floats are getting cast to 32-bit floats and then back to 64-bit floats, with the dtype of the index remaining 'float64' .

How do people use float indicies?

Is the idea that you might have some statistical calculation over data and want to rank on the result of it, but those results may be floats? And we want to force float64 to avoid losing resolution?

Float indices are generally useless for label-based indexing, because of general floating point restrictions . Of course, pd.Float64Index is there in the API for completeness but that doesn't always mean you should use it. Jeff (core library contributor) has this to say on github :

[...] It is rarely necessary to actually use a float index; you are often better off served by using a column. The point of the index is to make individual elements faster, eg df[1.0], but this is quite tricky; this is the reason for having an issue about this.

The tricky part there being 1.0 == 1.0 isn't always true, depending on how you represent that 1.0 in bits.

Floating indices are useful in a few situations (as cited in the github issue), mainly for recording temporal axis (time), or extremely minute/accurate measurements in, for example, astronomical data. For most other cases there's pd.cut or pd.qcut for binning your data because working with categorical data is usually easier than continuous data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM