I have a large dataset with the columns 'group' and 'postcode'. An example of the df is given below:
group postcode
group_1 WC2E 8BU
group_1 WC2E 8BU
group_1 WC2E 8BU
group_2 WC2E 8BU
group_2 WC2E 8BU
group_2 WC2E 8BU
group_2 WC1A 1DD
group_2 WC1A 1DD
group_2 WC1A 1DD
group_2 WC1A 1DD
1488087 WC1A 1DD
1488087 WC1A 1DD
I am trying to create a new column called 'random_val' to assign a random uniform number to each matching postcode in a unique group, for rows where there are no digits in the 'group' column. My code is shown below:
df.loc[~df['group'].astype(str).str.isdigit(), 'random_val'] = df['postcode'].map(dict(zip(df['postcode'].unique(), np.random.uniform(0, 1, size=len(self.data['postcode'].unique())))))
Currently, this code assigns a unique random number to a unique postcode, regardless of the group it is in:
group postcode random_val
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.210917735
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
1488087 WC1A 1DD
1488087 WC1A 1DD
However, I would like the random number to be unique to the postcode and the group:
group postcode random_val
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_1 WC2E 8BU 0.210917735
group_2 WC2E 8BU 0.494920676
group_2 WC2E 8BU 0.494920676
group_2 WC2E 8BU 0.494920676
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
group_2 WC1A 1DD 0.55733542
1488087 WC1A 1DD
1488087 WC1A 1DD
Struggling to figure out how to do this. Any help appreciated. Thanks
You can take advantage of pandas alignment here.
df.set_index('group',inplace=True)
unique_idx = df.index[~df.index.str.isdigit()].unique()
s = pd.Series(np.random.uniform(0,1,len(unique_idx)) , index =unique_idx)
df['random_value'] = s
df.reset_index()
group postcode random_value
0 group_1 WC2E 8BU 0.232501
1 group_1 WC2E 8BU 0.232501
2 group_1 WC2E 8BU 0.232501
3 group_2 WC2E 8BU 0.242696
4 group_2 WC2E 8BU 0.242696
5 group_2 WC2E 8BU 0.242696
6 group_2 WC1A 1DD 0.242696
7 group_2 WC1A 1DD 0.242696
8 group_2 WC1A 1DD 0.242696
9 group_2 WC1A 1DD 0.242696
10 1488087 WC1A 1DD NaN
11 1488087 WC1A 1DD NaN
Here's a solution:
def random_val(x):
return pd.Series([np.random.uniform(0, 1)] * x.size)
df["dummy"] = 1
df["random_val"] = df.groupby(["group", "postcode"])["dummy"].transform(random_val)
df.loc[df['group'].astype(str).str.isdigit(), "random_val"] = None
The result is:
group postcode dummy random_val
0 group_1 WC2E 8BU 1 0.781711
1 group_1 WC2E 8BU 1 0.781711
2 group_1 WC2E 8BU 1 0.781711
3 group_2 WC2E 8BU 1 0.107743
4 group_2 WC2E 8BU 1 0.107743
5 group_2 WC2E 8BU 1 0.107743
6 group_2 WC1A 1DD 1 0.103295
7 group_2 WC1A 1DD 1 0.103295
8 group_2 WC1A 1DD 1 0.103295
9 group_2 WC1A 1DD 1 0.103295
10 1488087 WC1A 1DD 1 NaN
11 1488087 WC1A 1DD 1 NaN
Hashing the two columns might be the simplest solution:
df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
group postcode hash
0 group_1 WC2E 8BU -8918045538474016779
1 group_1 WC2E 8BU -8918045538474016779
2 group_1 WC2E 8BU -8918045538474016779
3 group_2 WC2E 8BU -6943464964421442707
4 group_2 WC2E 8BU -6943464964421442707
5 group_2 WC2E 8BU -6943464964421442707
6 group_2 WC1A 1DD -357652478068898330
7 group_2 WC1A 1DD -357652478068898330
8 group_2 WC1A 1DD -357652478068898330
9 group_2 WC1A 1DD -357652478068898330
10 1488087 WC1A 1DD 1701757393872926575
11 1488087 WC1A 1DD 1701757393872926575
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.