I have data that looks something like this
df
Out[10]:
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
2 12 23 5 3/15/2016
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
The goal is to get a unique ID for each group of ID1 with particular prices for each of its ID2's, like so:
# Desired Result
df
Out[14]:
ID1 ID2 Price Date UID
0 11 21 10.99 3/15/2016 1
1 11 22 11.99 3/15/2016 1
2 12 23 5 3/15/2016 7
3 11 21 10.99 3/16/2016 5
4 11 22 12.99 3/16/2016 5
5 11 21 10.99 3/17/2016 1
6 11 22 11.99 3/17/2016 1
Speed is an issue because of the size of the data. The best way I can come up with is below, but it is still a fair amount slower than is desirable. If anyone has a way that they think should be naturally faster I'd love to hear it. Or perhaps there is an easy way to do the within group operations in parallel to speed things up?
My method basically concatenates ID's and prices (after padding with zeros to ensure same lengths) and then takes ranks to simplify the final ID. The bottleneck is the within group concatenation done with .transform(np.sum).
# concatenate ID2 and Price
df['ID23'] = df['ID2'] + df['Price']
df
Out[12]:
ID1 ID2 Price Date ID23
0 11 21 10.99 3/15/2016 2110.99
1 11 22 11.99 3/15/2016 2211.99
2 12 23 5 3/15/2016 235
3 11 21 10.99 3/16/2016 2110.99
4 11 22 12.99 3/16/2016 2212.99
5 11 21 10.99 3/17/2016 2110.99
6 11 22 11.99 3/17/2016 2211.99
# groupby ID1 and Date and then concatenate the ID23's
grouped = df.groupby(['ID1','Date'])
df['summed'] = grouped['ID23'].transform(np.sum)
df
Out[16]:
ID1 ID2 Price Date ID23 summed UID
0 6 3 0010.99 3/15/2016 30010.99 30010.9960011.99 630010.9960011.99
1 6 6 0011.99 3/15/2016 60011.99 30010.9960011.99 630010.9960011.99
2 7 7 0000005 3/15/2016 70000005 70000005 770000005
3 6 3 0010.99 3/16/2016 30010.99 30010.9960012.99 630010.9960012.99
4 6 6 0012.99 3/16/2016 60012.99 30010.9960012.99 630010.9960012.99
5 6 3 0010.99 3/17/2016 30010.99 30010.9960011.99 630010.9960011.99
6 6 6 0011.99 3/17/2016 60011.99 30010.9960011.99 630010.9960011.99
# Concatenate ID1 on the front and take rank to get simpler ID's
df['UID'] = df['ID1'] + df['summed']
df['UID'] = df['UID'].rank(method = 'min')
# Drop unnecessary columns
df.drop(['ID23','summed'], axis=1, inplace=True)
UPDATE:
To clarify, consider the original data grouped as follows:
grouped = df.groupby(['ID1','Date'])
for name, group in grouped:
print group
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
ID1 ID2 Price Date
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
ID1 ID2 Price Date
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
ID1 ID2 Price Date
2 12 23 5 3/15/2016
UID's should be at the group level and match if everything about that group is identical ignoring the date. So in this case the first and third printed groups are the same, meaning that rows 0,1,5, and 6 should all get the same UID. Rows 3 and 4 belong to a different group because a price changed and therefore need a different UID. Row 2 is also a different group.
A slightly different way of looking at this problem is that I want to group as I have here, drop the date column (which was important for initially forming the groups) and then aggregate across groups which are equal once I have removed the dates.
Edit: The code below is actually slower than OP's solution. I'm leaving it as it is for now in case someone uses it to write a better solution.
For visualization, I'll be using the following data:
df
Out[421]:
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
2 12 23 5.00 3/15/2016
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
7 11 22 11.99 3/18/2016
8 11 21 10.99 3/18/2016
9 12 22 11.99 3/18/2016
10 12 21 10.99 3/18/2016
11 12 23 5.00 3/19/2016
12 12 23 5.00 3/19/2016
First, let's group it by 'ID1' and 'Date' and aggregate the results as tuples (sorted). I also reset the index, so there is a new columns named 'index'.
gr = df.reset_index().groupby(['ID1','Date'], as_index = False)
df1 = gr.agg(lambda x : tuple(sorted(x)))
df1
Out[425]:
ID1 Date index ID2 Price
0 11 3/15/2016 (0, 1) (21, 22) (10.99, 11.99)
1 11 3/16/2016 (3, 4) (21, 22) (10.99, 12.99)
2 11 3/17/2016 (5, 6) (21, 22) (10.99, 11.99)
3 11 3/18/2016 (7, 8) (21, 22) (10.99, 11.99)
4 12 3/15/2016 (2,) (23,) (5.0,)
5 12 3/18/2016 (9, 10) (21, 22) (10.99, 11.99)
6 12 3/19/2016 (11, 12) (23, 23) (5.0, 5.0)
After all grouping is done, I'll use indices from column 'index'
to access rows from df
(they'd better be unique). (Notice also that df1.index
and df1['index']
are completely different things.)
Now, let's group 'index'
(skipping dates):
df2 = df1.groupby(['ID1','ID2','Price'], as_index = False)['index'].sum()
df2
Out[427]:
ID1 ID2 Price index
0 11 (21, 22) (10.99, 11.99) (0, 1, 5, 6, 7, 8)
1 11 (21, 22) (10.99, 12.99) (3, 4)
2 12 (21, 22) (10.99, 11.99) (9, 10)
3 12 (23,) (5.0,) (2,)
4 12 (23, 23) (5.0, 5.0) (11, 12)
I believe this is the grouping needed for the problem, so we can now add labels to df
. For example like this:
df['GID'] = -1
for i, t in enumerate(df2['index']):
df.loc[t,'GID'] = i
df
Out[430]:
ID1 ID2 Price Date GID
0 11 21 10.99 3/15/2016 0
1 11 22 11.99 3/15/2016 0
2 12 23 5.00 3/15/2016 3
3 11 21 10.99 3/16/2016 1
4 11 22 12.99 3/16/2016 1
5 11 21 10.99 3/17/2016 0
6 11 22 11.99 3/17/2016 0
7 11 22 11.99 3/18/2016 0
8 11 21 10.99 3/18/2016 0
9 12 22 11.99 3/18/2016 2
10 12 21 10.99 3/18/2016 2
11 12 23 5.00 3/19/2016 4
12 12 23 5.00 3/19/2016 4
Or in a possibly faster but tricky way:
# EXPERIMENTAL CODE!
df3 = df2['index'].apply(pd.Series).stack().reset_index()
df3.index = df3[0].astype(int)
df['GID'] = df3['level_0']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.