I have a csv file and I would like to create a 2d histogram where the value in each bin depends on the unique ID. For example (see below), for the range 0<x<1 and 1<y<2, the value is 2 (A, B) not 3 (A, A, B) because A appears twice. Thanks!
ID | x | y |
---|---|---|
A | 0.5 | 1.4 |
A | 0.6 | 1.6 |
A | 1.2 | 2.2 |
B | 0.7 | 1.7 |
C | 4.4 | 3.5 |
C | 3.1 | 3.7 |
A bin of i_x < x < j_x
, i_y < y < j_y
can be uniquely identified as the (i_x, i_y)
; we can see that this tuple is unique for each bin. i_x
and i_y
are simply the floor value of x
and y
. Like For row: (x, y) = (0.5, 1.4)
bin is: 0 < 0.5 < 1
, 1 < 1.4 < 1.2
here i_x = 0 = floor(0.5)
and i_y = 1 = floor(1.4)
.
Approach:
i_x
and i_y
for x and y columns.(i_x, i_y)
and count unique IDs
in each of the group.Code:
>>> df
ID x y
0 A 0.5 1.4
1 A 0.6 1.6
2 A 1.2 2.2
3 B 0.7 1.7
4 C 4.4 3.5
5 C 3.1 3.7
df['bin_x'] = np.floor(df.x).astype(int)
df['bin_y'] = np.floor(df.y).astype(int)
df = (df.groupby(['bin_x', 'bin_y'], as_index = False)['ID']
.agg({'cnt' : 'nunique'}))
>>> df
bin_x bin_y cnt
0 0 1 2
1 1 2 1
2 3 3 1
3 4 3 1
If you are defining your histogram as numpy array of size (5, 5) then we can assign cnt
values to that array and get the desired histogram.
histogram = np.zeros((5, 5))
histogram[df.bin_x, df.bin_y] = df.cnt
>>> histogram
array([[0., 2., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.]])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.