简体   繁体   English

识别数据框中的ID列

[英]Identify ID columns in a data frame

Is there any way to identify columns such as Account_Number, Employee_ID, Transaction_ID etc type of columns automatically in a data frame which are usually not included in model building? 有什么方法可以自动识别数据框中通常不包含在模型构建中的列,例如Account_Number,Employee_ID,Transaction_ID等列的类型? Note that there might be more than one record of the same employee across different dates. 请注意,同一员工在不同日期可能有多个记录。 In short, how to identify useless columns when they are not unique? 简而言之,当无用的列不唯一时,如何识别它们?

There are several ways to recognize the lease important columns/classes/features, in a dataset. 有几种方法可以识别数据集中的租用重要列/类/功能。 Correlation is one of them. 相关性就是其中之一。 Follow the example below by first downloading this movies dataset from Kaggle. 按照以下示例操作,首先从Kaggle下载电影数据集。

df = pd.read_csv("tmdb_5000_movies.csv")
df = df[["id", "budget", "popularity", "vote_average"]]
df.head()

This is how the dataframe looks: 数据框的外观如下:

    id       budget     popularity  vote_average
0   19995   237000000   150.437577  7.2
1   285     300000000   139.082615  6.9
2   206647  245000000   107.376788  6.3
3   49026   250000000   112.312950  7.6
4   49529   260000000   43.926995   6.1

We are looking for an automatic way of detecting that "id" is a useless column. 我们正在寻找一种自动检测“ id”是无用列的方法。

Let's find the correlation between each column and the other: 让我们找到每一列与另一列之间的相关性:

corr_df = pd.DataFrame(columns=list(df.columns))
for col_from in df.columns:
    for col_to in df.columns:
        corr_df.loc[col_from, col_to] = df[col_from].corr(df[col_to])
print(corr_df.head())

Correlation is simply a measure between -1 and 1, numbers close to zero indicate that the two classes are uncorrelated, the further you go from zero, (even in the negative direction) is an indication that the two parameters are coupled in some sense. 相关只是在-1和1之间的一个量度,接近零的数字表示这两个类是不相关的,离零越远(甚至在负方向上)也表明这两个参数在某种意义上是耦合的。 Observe how id has a very small correlation with budget and popularity 观察idbudgetpopularity之间的关系如何

                     id     budget popularity vote_average
id                    1 -0.0893767   0.031202    -0.270595
budget       -0.0893767          1   0.505414    0.0931457
popularity     0.031202   0.505414          1     0.273952
vote_average  -0.270595  0.0931457   0.273952            1

Let's go a little step further and get the absolute value and sum all the correlations, the class with the least correlation score is considered the least useless: 让我们进一步走一步,获取绝对值并将所有相关性求和,相关性得分最小的类被认为是最没有用的:

corr_df = corr_df.abs()
corr_df["sum"] = corr_df.sum(axis=0) - 1
print(corr_df.head())

Result: 结果:

                     id     budget popularity vote_average       sum
id                    1  0.0893767   0.031202     0.270595  0.391173
budget        0.0893767          1   0.505414    0.0931457  0.687936
popularity     0.031202   0.505414          1     0.273952  0.810568
vote_average   0.270595  0.0931457   0.273952            1  0.637692

Not that there are many issues with this method, for example: if ids are increasing from 0 to N and there is a value that is also increasing amongst the rows with a constant rate, their correlation will be high; 并不是说这种方法有很多问题,例如:如果id从0增加到N,并且行之间的值也以恒定的比率增加,则它们的相关性会很高; moreover, some column X might yield a smaller correlation with column Y than the correlation between Y and id; 此外,某些X列与Y列的相关性可能比Y与id之间的相关性小。 nevertheless the absolute sum result is good enough in most cases. 但是,在大多数情况下,绝对总和结果足够好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM