简体   繁体   中英

How does one determine which columns to set as an index in a Pandas DataFrame?

Let's say I have a DataFrame of financial securities, which often have multiple identifiers:

在此输入图像描述

Should I choose only one column to set as the index? Should I set all potential identifiers as the index? Should I set all text data as an index, and leave all numeric data as columns? What is the best practice?

This is more about database design than pandas.

The decision should be based on the business meaning of the dataframe (table in relational database) and its columns. Eg, if 'Internal Security ID' is used to identify this kind of data in its business, then it should be set as the index.

However, if you are not sure, just stick with the default integer index.

I tend to stick with the default index unless you have a need to have one of your columns as an index. If you do, I strongly recommend using a column with unique values. If there exists duplicates, this will cause you a lot of headache.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM