I wanted a python alternative to discretize2d in R. An alternative I found over stackoverflow was to use pandas.crosstab and pandas.cut as so,
pandas.crosstab(pandas.cut(list1,bins=bin1,include_lowest=True),pandas.cut(list2,bins=bin2,include_lowest=True))
The above is quite slow when dealing with a large number of lists (say around 20 million). Is there a faster python alternative to the above?
I think you are looking for numpy.histogram2d
:
Example:
list1 = np.random.normal(2, 1, 100)
list2 = np.random.normal(1, 1, 100)
bin1 = [0, 1, 3, 5]
bin2 = [0, 2, 3, 4, 6]
H, xedges, yedges = np.histogram2d(list1, list2, bins=(bin1, bin2))
xlabels = [f"{l}-{r}" for l, r in zip(xedges, xedges[1:])]
ylabels = [f"{l}-{r}" for l, r in zip(yedges, yedges[1:])]
df = pd.DataFrame(H, index=xlabels, columns=ylabels)
Output:
>>> df
0-2 2-3 3-4 4-6
0-1 13.0 2.0 0.0 0.0
1-3 38.0 10.0 1.0 0.0
3-5 10.0 2.0 0.0 0.0
>>> xedges # How to interpret
array([0, 1, 3, 5]) # [0, 1), [1, 3), [3, 5] <- the last bin includes 5
>>> yedges # How to interpret
array([0, 2, 3, 4, 6]) # [0, 2), [2, 3), [3, 4), [4, 6] <- the last bin includes 6
Update : helper function
def labels(edges):
labels = [f"[{l}, {r})" for l, r in zip(edges, edges[1:])]
labels[-1] = labels[-1].replace(')', ']')
return labels
df = pd.DataFrame(H, index=labels(xedges), columns=labels(yedges))
Usage:
>>> df
[0, 2) [2, 3) [3, 4) [4, 6]
[0, 1) 9.0 1.0 0.0 0.0
[1, 3) 41.0 17.0 2.0 0.0
[3, 5] 12.0 4.0 0.0 0.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.