[英]Pandas: How to assign random number based on unique column values
I have a large dataset with the columns 'group' and 'postcode'.我有一个包含“组”和“邮政编码”列的大型数据集。 An example of the df is given below:
下面给出了一个 df 的示例:
group postcode
group_1 WC2E 8BU
group_1 WC2E 8BU
group_1 WC2E 8BU
group_2 WC2E 8BU
group_2 WC2E 8BU
group_2 WC2E 8BU
group_2 WC1A 1DD
group_2 WC1A 1DD
group_2 WC1A 1DD
group_2 WC1A 1DD
1488087 WC1A 1DD
1488087 WC1A 1DD
I am trying to create a new column called 'random_val' to assign a random uniform number to each matching postcode in a unique group, for rows where there are no digits in the 'group' column.我正在尝试创建一个名为“random_val”的新列,为唯一组中的每个匹配邮政编码分配一个随机统一编号,用于“组”列中没有数字的行。 My code is shown below:
我的代码如下所示:
df.loc[~df['group'].astype(str).str.isdigit(), 'random_val'] = df['postcode'].map(dict(zip(df['postcode'].unique(), np.random.uniform(0, 1, size=len(self.data['postcode'].unique())))))
Currently, this code assigns a unique random number to a unique postcode, regardless of the group it is in:目前,此代码为唯一的邮政编码分配一个唯一的随机数,无论它属于哪个组:
group postcode random_val
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.210917735
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
1488087 WC1A 1DD
1488087 WC1A 1DD
However, I would like the random number to be unique to the postcode and the group:但是,我希望随机数对邮政编码和组是唯一的:
group postcode random_val
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.494920676
group_2 WC2E 8BU 0.494920676
group_2 WC2E 8BU 0.494920676
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
1488087 WC1A 1DD
1488087 WC1A 1DD
Struggling to figure out how to do this.正在努力弄清楚如何做到这一点。 Any help appreciated.
任何帮助表示赞赏。 Thanks
谢谢
You can take advantage of pandas alignment here.您可以在此处利用 pandas alignment。
df.set_index('group',inplace=True)
unique_idx = df.index[~df.index.str.isdigit()].unique()
s = pd.Series(np.random.uniform(0,1,len(unique_idx)) , index =unique_idx)
df['random_value'] = s
df.reset_index()
group postcode random_value
0 group_1 WC2E 8BU 0.232501
1 group_1 WC2E 8BU 0.232501
2 group_1 WC2E 8BU 0.232501
3 group_2 WC2E 8BU 0.242696
4 group_2 WC2E 8BU 0.242696
5 group_2 WC2E 8BU 0.242696
6 group_2 WC1A 1DD 0.242696
7 group_2 WC1A 1DD 0.242696
8 group_2 WC1A 1DD 0.242696
9 group_2 WC1A 1DD 0.242696
10 1488087 WC1A 1DD NaN
11 1488087 WC1A 1DD NaN
Here's a solution:这是一个解决方案:
def random_val(x):
return pd.Series([np.random.uniform(0, 1)] * x.size)
df["dummy"] = 1
df["random_val"] = df.groupby(["group", "postcode"])["dummy"].transform(random_val)
df.loc[df['group'].astype(str).str.isdigit(), "random_val"] = None
The result is:结果是:
group postcode dummy random_val
0 group_1 WC2E 8BU 1 0.781711
1 group_1 WC2E 8BU 1 0.781711
2 group_1 WC2E 8BU 1 0.781711
3 group_2 WC2E 8BU 1 0.107743
4 group_2 WC2E 8BU 1 0.107743
5 group_2 WC2E 8BU 1 0.107743
6 group_2 WC1A 1DD 1 0.103295
7 group_2 WC1A 1DD 1 0.103295
8 group_2 WC1A 1DD 1 0.103295
9 group_2 WC1A 1DD 1 0.103295
10 1488087 WC1A 1DD 1 NaN
11 1488087 WC1A 1DD 1 NaN
Hashing the two columns might be the simplest solution:散列两列可能是最简单的解决方案:
df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
group postcode hash
0 group_1 WC2E 8BU -8918045538474016779
1 group_1 WC2E 8BU -8918045538474016779
2 group_1 WC2E 8BU -8918045538474016779
3 group_2 WC2E 8BU -6943464964421442707
4 group_2 WC2E 8BU -6943464964421442707
5 group_2 WC2E 8BU -6943464964421442707
6 group_2 WC1A 1DD -357652478068898330
7 group_2 WC1A 1DD -357652478068898330
8 group_2 WC1A 1DD -357652478068898330
9 group_2 WC1A 1DD -357652478068898330
10 1488087 WC1A 1DD 1701757393872926575
11 1488087 WC1A 1DD 1701757393872926575
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.