Starting off with a structured numpy array that has 4 fields, I am trying to return an array with just the latest dates, by ID, containing the same 4 fields. I found a solution using itertools.groupby
that almost works here: Numpy Mean Structured Array
The problem is I don't understand how to adapt this when you have 4 fields instead of 2. I want to get the whole 'row' back, but only the rows for the latest dates for each ID. I understand that this kind of thing is simpler using pandas, but this is just a small piece of a larger process, and I can't add pandas as a dependency.
data = np.array([('2005-02-01', 1, 3, 8),
('2005-02-02', 1, 4, 9),
('2005-02-01', 2, 5, 10),
('2005-02-02', 2, 6, 11),
('2005-02-03', 2, 7, 12)],
dtype=[('dt', 'datetime64[D]'), ('ID', '<i4'), ('f3', '<i4'),
('f4', '<i4')])
For this example array, my desired output would be:
np.array([(datetime.date(2005, 2, 2), 1, 4, 9),
(datetime.date(2005, 2, 3), 2, 7, 12)],
dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
This is what I've tried:
latest = np.array([(k, np.array(list(g), dtype=data.dtype).view(np.recarray)
['dt'].argmax()) for k, g in
groupby(np.sort(data, order='ID').view(np.recarray),
itemgetter('ID'))], dtype=data.dtype)
I get this error:
ValueError: size of tuple must match number of fields.
I think this is because the tuple has 2 fields but the array has 4. When I drop 'f3'
and 'f4'
from the array it works correctly.
How can I get it to return all 4 fields?
Lets figure out where your error is by pealing off one layer:
In [38]: from operator import itemgetter
In [39]: from itertools import groupby
In [41]: [(k, np.array(list(g), dtype=data.dtype).view(np.recarray)
['dt'].argmax()) for k, g in
groupby(np.sort(data, order='ID').view(np.recarray),
itemgetter('ID'))]
Out[41]: [(1, 1), (2, 2)]
What is this list of tuples supposed to represent? It clearly isn't rows from data
. And since each tuple has only 2 items it can't be mapped onto a data.dtype
array. Hence the value error.
After playing around with this a bit, I think: [(1, 1), (2, 2)]
means, for ID==1
, use the [1]
item from the group; for ID==2
, use [2]
item from the group.
[(datetime.date(2005, 2, 2), 1, 4, 9),
(datetime.date(2005, 2, 3), 2, 7, 12)]
You have found the maximum dates, but you have to translate those to either indexes in data
, or select those items from the groups.
In [91]: groups=groupby(np.sort(data, order='ID').itemgetter('ID'))
# don't need recarray
In [92]: G = [(k,list(g)) for k,g in groups]
In [93]: G
Out[93]:
[(1,
[(datetime.date(2005, 2, 1), 1, 3, 8),
(datetime.date(2005, 2, 2), 1, 4, 9)]),
(2,
[(datetime.date(2005, 2, 1), 2, 5, 10),
(datetime.date(2005, 2, 2), 2, 6, 11),
(datetime.date(2005, 2, 3), 2, 7, 12)])]
In [107]: I=[(1,1), (2,2)]
In [108]: [g[1][i[1]] for g,i in zip(G,I)]
Out[108]: [(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)]
OK, this selection from G
is clumsy, but it is a start.
If I define a simple function to pull the record with the latest date from a group, the processing is a lot simpler.
def maxdate_record(agroup):
an_array = np.array(list(agroup))
i = np.argmax(an_array['dt'])
return an_array[i]
groups = groupby(np.sort(data, order='ID'),itemgetter('ID'))
np.array([maxdate_record(g) for k,g in groups])
producing:
array([(datetime.date(2005, 2, 2), 1, 4, 9),
(datetime.date(2005, 2, 3), 2, 7, 12)],
dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
I don't need to specify dtype
when I convert a list of records to an array, since the records have their own dtype.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.