[英](pandas)How can I create a unique identifier based on three similar columns of data, where order doesn't matter?
(Python/Pandas) I'm doing some analysis on UK House Price data looking at whether house prices respond to quality of nearby schools. (Python/Pandas)我正在对英国房价数据进行一些分析,研究房价是否对附近学校的质量做出反应。 I have matched the URN (Unique Reference Number) of the three nearest schools to each house price transaction.These are columns URN_1, URN_2, URN_3 in the data.
我已经将最近三所学校的 URN(唯一参考编号)与每个房价交易进行了匹配。这些是数据中的 URN_1、URN_2、URN_3 列。
I would like to estimate a fixed effects model on the data, where the fixed effects are based on the three nearest schools.我想估计数据的固定效应模型,其中固定效应基于最近的三所学校。 I therefore want to create a unique ID for each cluster of three schools and I want this to be unaffected by the order of the schools .eg Property A and Property B should have the same ID, despite the different order of the schools.
因此,我想为三所学校的每个集群创建一个唯一的 ID,并且我希望这不受学校顺序的影响。例如,尽管学校的顺序不同,但物业 A 和物业 B 应该具有相同的 ID。
Property URN_1 URN_2 URN_3
A 100053 100052 100054
B 100052 100054 100053
Does anyone know how I create unique cluster ids using Python?有谁知道我如何使用 Python 创建唯一的集群 ID?
I've tried using .groupby() to create the ID with the code below, but this gives different cluster ids, when the order of schools are different.我已经尝试使用 .groupby() 使用下面的代码创建 ID,但是当学校的顺序不同时,这会给出不同的集群 ID。
Here is what I have tried:这是我尝试过的:
import pandas as pd
URN1=[1,2,3,4,5]
URN2=[5,4,3,2,1]
URN3=[1,2,3,2,1]
lst=['a','b','c','d','e']
df=pd.DataFrame(list(zip(URN1,URN2,URN3)),
columns['URN_1','URN_2','URN_3'],index=lst)
df['clusterid']=df.groupby(['URN_1','URN_2','URN_3']).ngroup()
print(df)
I'd want to have observations 'a' and 'e' have the same cluster id, but they are given different ids by this method.我希望观察 'a' 和 'e' 具有相同的集群 id,但是通过这种方法它们被赋予了不同的 id。
This works if your data is not too long:如果您的数据不太长,这会起作用:
# we sort the values of each row
# and turn them to tuples
markers = (df[['URN_1','URN_2','URN_3']]
.apply(lambda x: tuple(sorted(x.values)), axis=1)
)
df['clisterid'] = df.groupby(markers).ngroup()
Output:输出:
Property URN_1 URN_2 URN_3 clisterid
0 A 100053 100052 100054 0
1 B 100052 100054 100053 0
Option 2: since the above solution uses apply
, which might not be ideal in some cases.选项 2:由于上述解决方案使用
apply
,这在某些情况下可能并不理想。 Here's a little math trick: it's known that a group (a,b,c)
is uniquely defined (up to a permutation) by (a+b+c, a**2+b**2+c**2, abc)
.这是一个数学小技巧:众所周知,一个群
(a,b,c)
) 由(a+b+c, a**2+b**2+c**2, abc)
。 So we can compute those values and group by them:所以我们可以计算这些值并按它们分组:
tmp_df = df[['URN_1','URN_2','URN_3']]
s = tmp_df.sum(1) # sums
sq = (tmp_df**2).sum(1) # sum of squares
p = tmp_df.prod(1) # products
# groupby
df['clisterid'] = df.groupby([s,sq,p]).ngroup()
Performance : The first approach takes 14s to process 2 million rows, while the 2nd takes less than 1 second.性能:第一种方法处理 200 万行需要 14 秒,而第二种方法需要不到 1 秒。
Use factorize on the unique string-like object of the combinations.对组合的唯一类似字符串的对象使用factorize 。 Since the order does not matter, we sort it first and combine it.
既然顺序无关紧要,我们先排序再组合。
df['clusterid'] = pd.factorize(df[['URN_1','URN_2','URN_3']].apply(lambda x: ','.join([str(y) for y in sorted(x)]),1))[0]
Output:输出:
URN_1 URN_2 URN_3 clusterid clisterid
a 1 5 1 0 0
b 2 4 2 1 1
c 3 3 3 2 2
d 4 2 2 3 1
e 5 1 1 4 0
You can create a string for each using the 3 URNs sorted.您可以使用排序的 3 个 URN 为每个创建一个字符串。
Then group by this new variable and use ngroup() as you tried before然后按这个新变量分组并使用 ngroup() 就像你之前尝试过的那样
df['URN_join'] = df[['URN_1','URN_2','URN_3']].apply(lambda x: '_'.join([str(nb) for nb in sorted(x)]), axis=1)
df['clusterid'] = df.groupby(['URN_join']).ngroup()
df
Output :输出 :
URN_1 URN_2 URN_3 clusterid URN_join
a 1 5 1 0 1_1_5
b 2 4 2 1 2_2_4
c 3 3 3 2 3_3_3
d 4 2 2 1 2_2_4
e 5 1 1 0 1_1_5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.