简体   繁体   English

如何确定在Pandas DataFrame中将哪些列设置为索引?

[英]How does one determine which columns to set as an index in a Pandas DataFrame?

Let's say I have a DataFrame of financial securities, which often have multiple identifiers: 假设我有一个金融证券的数据框架,它通常有多个标识符:

在此输入图像描述

Should I choose only one column to set as the index? 我应该只选择一列作为索引吗? Should I set all potential identifiers as the index? 我应该将所有潜在标识符设置为索引吗? Should I set all text data as an index, and leave all numeric data as columns? 我应该将所有文本数据设置为索引,并将所有数字数据保留为列吗? What is the best practice? 什么是最佳做法?

This is more about database design than pandas. 这更多是关于数据库设计而不是熊猫。

The decision should be based on the business meaning of the dataframe (table in relational database) and its columns. 决策应基于数据框(关系数据库中的表)及其列的业务含义。 Eg, if 'Internal Security ID' is used to identify this kind of data in its business, then it should be set as the index. 例如,如果“内部安全ID”用于识别其业务中的此类数据,则应将其设置为索引。

However, if you are not sure, just stick with the default integer index. 但是,如果您不确定,请坚持使用默认的整数索引。

I tend to stick with the default index unless you have a need to have one of your columns as an index. 我倾向于坚持使用默认索引,除非您需要将一个列作为索引。 If you do, I strongly recommend using a column with unique values. 如果您这样做,我强烈建议您使用具有唯一值的列。 If there exists duplicates, this will cause you a lot of headache. 如果存在重复,这将引起您很多头痛。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM