简体   繁体   中英

Subsetting numpy array by hour and day of week

I have a numpy array containing millions of hourly xy points with the "columns" of the array being x, y, hour, and day of week (all ints). Here is an example of what the array looks like:

array([[1, 2, 0, 0],
       [3, 5, 0, 0],
       [6, 3, 1, 0],
       [6, 2, 3, 0],
       [4, 3, 3, 1]])

I have created a grid of zeros that I can increment for all values in the array:

grid = np.zeros((8,8))
for value in range(0,len(xy_new[:,1])):  
    grid[xy_new[value][1],xy_new[value][0]] += 1

but I need to be able to do this for each hour by day of week (ie Sun at hour 0, Sun at hour 1, etc.).

How do I subset the array by hour and day of week?

I have attempted modifying the answers here: Make subset of array, based on values of two other arrays in Python , Subsetting data in Python , but have not been successful. Any help would be greatly appreciated!!

Presumably you want to wind up with 24 times 7 or 168 sets of accumulated counts for pairs of x and y . Suppose you have your data in a N by 4 array gdat . First, make week-hour index:

whr = 24*gdat[:,2] + gdat[:,3]

You can now select the gdat rows for each hour in your week. For example, for hour zero of Sunday:

gdat0 = gdat[whr == 0]

Do whatever summing you need with gdat0 and move on to the next hour.

Note that unique is probably a faster way to count occurrences of x, y pairs. You can play the same game of making a composite index for x and y , but you have to know how they are bounded. Supposing x runs from 0 to 120 and y runs from 0 to 5, you could make a composite index using bit fields:

xy = (gdat0[:,0] << 3) & (gdat0[:,1])

Obviously, if y has a larger range you need to shift more than 3 bits, and you may need to offset x and y to avoid negative values.

Then, use unique to return the unique values and counts for the values in xy .

xyval, xycnt = np.unique(xy, return_counts=True)

You then retrieve the x and y value pairs from xyval using bitwise operators, xyval >> 3 and xyval & 7 .

Repeat for every hour in the week. Since storage will be an issue if N is huge, you probably want to re-use gdat0 on each iteration.

EDIT: The short data sample you posted is time-sequential. If all your data are time-sequential, you don't need to "select" for each hour. All you need is to find the index for each new value in whr . unique(whr, return_index=True) will find those for you as well!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM