简体   繁体   中英

Pandas: How to assign random number based on unique column values

I have a large dataset with the columns 'group' and 'postcode'. An example of the df is given below:

group   postcode
group_1 WC2E 8BU
group_1 WC2E 8BU
group_1 WC2E 8BU
group_2 WC2E 8BU
group_2 WC2E 8BU
group_2 WC2E 8BU
group_2 WC1A 1DD
group_2 WC1A 1DD
group_2 WC1A 1DD
group_2 WC1A 1DD
1488087 WC1A 1DD
1488087 WC1A 1DD

I am trying to create a new column called 'random_val' to assign a random uniform number to each matching postcode in a unique group, for rows where there are no digits in the 'group' column. My code is shown below:

df.loc[~df['group'].astype(str).str.isdigit(), 'random_val'] = df['postcode'].map(dict(zip(df['postcode'].unique(), np.random.uniform(0, 1, size=len(self.data['postcode'].unique())))))

Currently, this code assigns a unique random number to a unique postcode, regardless of the group it is in:

group   postcode    random_val
group_1 WC2E 8BU    0.210917735
group_1 WC2E 8BU    0.210917735
group_1 WC2E 8BU    0.210917735
group_2 WC2E 8BU    0.210917735
group_2 WC2E 8BU    0.210917735
group_2 WC2E 8BU    0.210917735
group_2 WC1A 1DD    0.55733542
group_2 WC1A 1DD    0.55733542
group_2 WC1A 1DD    0.55733542
group_2 WC1A 1DD    0.55733542
1488087 WC1A 1DD    
1488087 WC1A 1DD

However, I would like the random number to be unique to the postcode and the group:

group   postcode    random_val
group_1 WC2E 8BU    0.210917735
group_1 WC2E 8BU    0.210917735
group_1 WC2E 8BU    0.210917735
group_2 WC2E 8BU    0.494920676
group_2 WC2E 8BU    0.494920676
group_2 WC2E 8BU    0.494920676
group_2 WC1A 1DD    0.55733542
group_2 WC1A 1DD    0.55733542
group_2 WC1A 1DD    0.55733542
group_2 WC1A 1DD    0.55733542
1488087 WC1A 1DD    
1488087 WC1A 1DD    

Struggling to figure out how to do this. Any help appreciated. Thanks

You can take advantage of pandas alignment here.

df.set_index('group',inplace=True)
unique_idx = df.index[~df.index.str.isdigit()].unique()
s = pd.Series(np.random.uniform(0,1,len(unique_idx)) , index =unique_idx)
df['random_value'] = s
df.reset_index()

      group  postcode  random_value
0   group_1  WC2E 8BU      0.232501
1   group_1  WC2E 8BU      0.232501
2   group_1  WC2E 8BU      0.232501
3   group_2  WC2E 8BU      0.242696
4   group_2  WC2E 8BU      0.242696
5   group_2  WC2E 8BU      0.242696
6   group_2  WC1A 1DD      0.242696
7   group_2  WC1A 1DD      0.242696
8   group_2  WC1A 1DD      0.242696
9   group_2  WC1A 1DD      0.242696
10  1488087  WC1A 1DD           NaN
11  1488087  WC1A 1DD           NaN

Here's a solution:

def random_val(x):
    return pd.Series([np.random.uniform(0, 1)] * x.size)

df["dummy"] = 1

df["random_val"] = df.groupby(["group", "postcode"])["dummy"].transform(random_val)
df.loc[df['group'].astype(str).str.isdigit(), "random_val"] = None

The result is:

      group  postcode  dummy  random_val
0   group_1  WC2E 8BU      1    0.781711
1   group_1  WC2E 8BU      1    0.781711
2   group_1  WC2E 8BU      1    0.781711
3   group_2  WC2E 8BU      1    0.107743
4   group_2  WC2E 8BU      1    0.107743
5   group_2  WC2E 8BU      1    0.107743
6   group_2  WC1A 1DD      1    0.103295
7   group_2  WC1A 1DD      1    0.103295
8   group_2  WC1A 1DD      1    0.103295
9   group_2  WC1A 1DD      1    0.103295
10  1488087  WC1A 1DD      1         NaN
11  1488087  WC1A 1DD      1         NaN

Hashing the two columns might be the simplest solution:

df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))

    group   postcode    hash
0   group_1 WC2E 8BU    -8918045538474016779
1   group_1 WC2E 8BU    -8918045538474016779
2   group_1 WC2E 8BU    -8918045538474016779
3   group_2 WC2E 8BU    -6943464964421442707
4   group_2 WC2E 8BU    -6943464964421442707
5   group_2 WC2E 8BU    -6943464964421442707
6   group_2 WC1A 1DD    -357652478068898330
7   group_2 WC1A 1DD    -357652478068898330
8   group_2 WC1A 1DD    -357652478068898330
9   group_2 WC1A 1DD    -357652478068898330
10  1488087 WC1A 1DD    1701757393872926575
11  1488087 WC1A 1DD    1701757393872926575

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM