(pandas)如何根据三个相似的数据列创建唯一标识符，其中顺序无关紧要？

Question

(Python/Pandas) I'm doing some analysis on UK House Price data looking at whether house prices respond to quality of nearby schools. （Python/Pandas）我正在对英国房价数据进行一些分析，研究房价是否对附近学校的质量做出反应。 I have matched the URN (Unique Reference Number) of the three nearest schools to each house price transaction.These are columns URN_1, URN_2, URN_3 in the data.我已经将最近三所学校的 URN（唯一参考编号）与每个房价交易进行了匹配。这些是数据中的 URN_1、URN_2、URN_3 列。

I would like to estimate a fixed effects model on the data, where the fixed effects are based on the three nearest schools.我想估计数据的固定效应模型，其中固定效应基于最近的三所学校。 I therefore want to create a unique ID for each cluster of three schools and I want this to be unaffected by the order of the schools .eg Property A and Property B should have the same ID, despite the different order of the schools.因此，我想为三所学校的每个集群创建一个唯一的 ID，并且我希望这不受学校顺序的影响。例如，尽管学校的顺序不同，但物业 A 和物业 B 应该具有相同的 ID。

Property    URN_1   URN_2   URN_3
A         100053   100052   100054
B         100052   100054   100053

Does anyone know how I create unique cluster ids using Python?有谁知道我如何使用 Python 创建唯一的集群 ID？

I've tried using .groupby() to create the ID with the code below, but this gives different cluster ids, when the order of schools are different.我已经尝试使用 .groupby() 使用下面的代码创建 ID，但是当学校的顺序不同时，这会给出不同的集群 ID。

Here is what I have tried:这是我尝试过的：

import pandas as pd
URN1=[1,2,3,4,5]
URN2=[5,4,3,2,1]
URN3=[1,2,3,2,1]
lst=['a','b','c','d','e']
df=pd.DataFrame(list(zip(URN1,URN2,URN3)),
columns['URN_1','URN_2','URN_3'],index=lst)
df['clusterid']=df.groupby(['URN_1','URN_2','URN_3']).ngroup()
print(df)

I'd want to have observations 'a' and 'e' have the same cluster id, but they are given different ids by this method.我希望观察 'a' 和 'e' 具有相同的集群 id，但是通过这种方法它们被赋予了不同的 id。

Answer 1

This works if your data is not too long:如果您的数据不太长，这会起作用：

# we sort the values of each row
# and turn them to tuples
markers = (df[['URN_1','URN_2','URN_3']]
             .apply(lambda x: tuple(sorted(x.values)), axis=1)
          )

df['clisterid'] = df.groupby(markers).ngroup()

Output:输出：

  Property   URN_1   URN_2   URN_3  clisterid
0        A  100053  100052  100054          0
1        B  100052  100054  100053          0

Option 2: since the above solution uses apply , which might not be ideal in some cases.选项 2：由于上述解决方案使用apply ，这在某些情况下可能并不理想。 Here's a little math trick: it's known that a group (a,b,c) is uniquely defined (up to a permutation) by (a+b+c, a**2+b**2+c**2, abc) .这是一个数学小技巧：众所周知，一个群(a,b,c) ) 由(a+b+c, a**2+b**2+c**2, abc) 。 So we can compute those values and group by them:所以我们可以计算这些值并按它们分组：

tmp_df = df[['URN_1','URN_2','URN_3']]

s = tmp_df.sum(1)         # sums
sq = (tmp_df**2).sum(1)   # sum of squares
p = tmp_df.prod(1)        # products

# groupby
df['clisterid'] = df.groupby([s,sq,p]).ngroup()

Performance : The first approach takes 14s to process 2 million rows, while the 2nd takes less than 1 second.性能：第一种方法处理 200 万行需要 14 秒，而第二种方法需要不到 1 秒。

Answer 2

Use factorize on the unique string-like object of the combinations.对组合的唯一类似字符串的对象使用factorize 。 Since the order does not matter, we sort it first and combine it.既然顺序无关紧要，我们先排序再组合。

df['clusterid'] = pd.factorize(df[['URN_1','URN_2','URN_3']].apply(lambda x: ','.join([str(y) for y in sorted(x)]),1))[0]

Output:输出：

       URN_1  URN_2  URN_3  clusterid  clisterid
a      1      5      1          0          0
b      2      4      2          1          1
c      3      3      3          2          2
d      4      2      2          3          1
e      5      1      1          4          0

Answer 3

You can create a string for each using the 3 URNs sorted.您可以使用排序的 3 个 URN 为每个创建一个字符串。
Then group by this new variable and use ngroup() as you tried before然后按这个新变量分组并使用 ngroup() 就像你之前尝试过的那样

df['URN_join'] = df[['URN_1','URN_2','URN_3']].apply(lambda x: '_'.join([str(nb) for nb in sorted(x)]), axis=1)
df['clusterid'] = df.groupby(['URN_join']).ngroup()
df

Output :输出：

    URN_1   URN_2   URN_3   clusterid   URN_join
a   1       5       1       0           1_1_5
b   2       4       2       1           2_2_4
c   3       3       3       2           3_3_3
d   4       2       2       1           2_2_4
e   5       1       1       0           1_1_5

(pandas)如何根据三个相似的数据列创建唯一标识符，其中顺序无关紧要？

问题描述

3 个解决方案

解决方案1
1 2019-06-17 15:02:40

解决方案2
0 2019-06-17 14:49:32

解决方案3
0 2019-06-17 15:07:54

(pandas)如何根据三个相似的数据列创建唯一标识符，其中顺序无关紧要？

问题描述

3 个解决方案

解决方案1 1 2019-06-17 15:02:40

解决方案2 0 2019-06-17 14:49:32

解决方案3 0 2019-06-17 15:07:54

解决方案1
1 2019-06-17 15:02:40

解决方案2
0 2019-06-17 14:49:32

解决方案3
0 2019-06-17 15:07:54