简体   繁体   中英

Faster python alternative for discretize2d in R?

I wanted a python alternative to discretize2d in R. An alternative I found over stackoverflow was to use pandas.crosstab and pandas.cut as so,

pandas.crosstab(pandas.cut(list1,bins=bin1,include_lowest=True),pandas.cut(list2,bins=bin2,include_lowest=True))

The above is quite slow when dealing with a large number of lists (say around 20 million). Is there a faster python alternative to the above?

I think you are looking for numpy.histogram2d :

Example:

list1 = np.random.normal(2, 1, 100)
list2 = np.random.normal(1, 1, 100)

bin1 = [0, 1, 3, 5]
bin2 = [0, 2, 3, 4, 6]

H, xedges, yedges = np.histogram2d(list1, list2, bins=(bin1, bin2))

xlabels = [f"{l}-{r}" for l, r in zip(xedges, xedges[1:])]
ylabels = [f"{l}-{r}" for l, r in zip(yedges, yedges[1:])]

df = pd.DataFrame(H, index=xlabels, columns=ylabels)

Output:

>>> df
      0-2   2-3  3-4  4-6
0-1  13.0   2.0  0.0  0.0
1-3  38.0  10.0  1.0  0.0
3-5  10.0   2.0  0.0  0.0

>>> xedges  # How to interpret
array([0, 1, 3, 5])  # [0, 1), [1, 3), [3, 5]  <- the last bin includes 5

>>> yedges  # How to interpret
array([0, 2, 3, 4, 6])  # [0, 2), [2, 3), [3, 4), [4, 6]  <- the last bin includes 6

Update : helper function

def labels(edges):
    labels = [f"[{l}, {r})" for l, r in zip(edges, edges[1:])]
    labels[-1] = labels[-1].replace(')', ']')
    return labels

df = pd.DataFrame(H, index=labels(xedges), columns=labels(yedges))

Usage:

>>> df

        [0, 2)  [2, 3)  [3, 4)  [4, 6]
[0, 1)     9.0     1.0     0.0     0.0
[1, 3)    41.0    17.0     2.0     0.0
[3, 5]    12.0     4.0     0.0     0.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM