简体   繁体   中英

Python pandas:Fast way to create a unique identifier for groups

I have data that looks something like this

df
Out[10]: 
  ID1 ID2  Price       Date
0  11  21  10.99  3/15/2016
1  11  22  11.99  3/15/2016
2  12  23      5  3/15/2016
3  11  21  10.99  3/16/2016
4  11  22  12.99  3/16/2016
5  11  21  10.99  3/17/2016
6  11  22  11.99  3/17/2016

The goal is to get a unique ID for each group of ID1 with particular prices for each of its ID2's, like so:

    # Desired Result
df
Out[14]: 
  ID1 ID2  Price       Date  UID
0  11  21  10.99  3/15/2016    1
1  11  22  11.99  3/15/2016    1
2  12  23      5  3/15/2016    7
3  11  21  10.99  3/16/2016    5
4  11  22  12.99  3/16/2016    5
5  11  21  10.99  3/17/2016    1
6  11  22  11.99  3/17/2016    1

Speed is an issue because of the size of the data. The best way I can come up with is below, but it is still a fair amount slower than is desirable. If anyone has a way that they think should be naturally faster I'd love to hear it. Or perhaps there is an easy way to do the within group operations in parallel to speed things up?

My method basically concatenates ID's and prices (after padding with zeros to ensure same lengths) and then takes ranks to simplify the final ID. The bottleneck is the within group concatenation done with .transform(np.sum).

# concatenate ID2 and Price
df['ID23'] = df['ID2'] + df['Price']

df
Out[12]: 
  ID1 ID2  Price       Date     ID23
0  11  21  10.99  3/15/2016  2110.99
1  11  22  11.99  3/15/2016  2211.99
2  12  23      5  3/15/2016      235
3  11  21  10.99  3/16/2016  2110.99
4  11  22  12.99  3/16/2016  2212.99
5  11  21  10.99  3/17/2016  2110.99
6  11  22  11.99  3/17/2016  2211.99


# groupby ID1 and Date and then concatenate the ID23's
grouped = df.groupby(['ID1','Date'])
df['summed'] = grouped['ID23'].transform(np.sum)

df
Out[16]: 
  ID1 ID2    Price       Date      ID23            summed                UID
0   6   3  0010.99  3/15/2016  30010.99  30010.9960011.99  630010.9960011.99
1   6   6  0011.99  3/15/2016  60011.99  30010.9960011.99  630010.9960011.99
2   7   7  0000005  3/15/2016  70000005          70000005          770000005
3   6   3  0010.99  3/16/2016  30010.99  30010.9960012.99  630010.9960012.99
4   6   6  0012.99  3/16/2016  60012.99  30010.9960012.99  630010.9960012.99
5   6   3  0010.99  3/17/2016  30010.99  30010.9960011.99  630010.9960011.99
6   6   6  0011.99  3/17/2016  60011.99  30010.9960011.99  630010.9960011.99

# Concatenate ID1 on the front and take rank to get simpler ID's    
df['UID'] = df['ID1'] + df['summed'] 
df['UID'] = df['UID'].rank(method = 'min')

# Drop unnecessary columns
df.drop(['ID23','summed'], axis=1, inplace=True)

UPDATE:

To clarify, consider the original data grouped as follows:

grouped = df.groupby(['ID1','Date'])
    for name, group in grouped:
    print group

  ID1 ID2  Price       Date
0  11  21  10.99  3/15/2016
1  11  22  11.99  3/15/2016

  ID1 ID2  Price       Date
3  11  21  10.99  3/16/2016
4  11  22  12.99  3/16/2016

  ID1 ID2  Price       Date
5  11  21  10.99  3/17/2016
6  11  22  11.99  3/17/2016

  ID1 ID2 Price       Date
2  12  23     5  3/15/2016

UID's should be at the group level and match if everything about that group is identical ignoring the date. So in this case the first and third printed groups are the same, meaning that rows 0,1,5, and 6 should all get the same UID. Rows 3 and 4 belong to a different group because a price changed and therefore need a different UID. Row 2 is also a different group.

A slightly different way of looking at this problem is that I want to group as I have here, drop the date column (which was important for initially forming the groups) and then aggregate across groups which are equal once I have removed the dates.

Edit: The code below is actually slower than OP's solution. I'm leaving it as it is for now in case someone uses it to write a better solution.


For visualization, I'll be using the following data:

df
Out[421]: 
    ID1  ID2  Price       Date
0    11   21  10.99  3/15/2016
1    11   22  11.99  3/15/2016
2    12   23   5.00  3/15/2016
3    11   21  10.99  3/16/2016
4    11   22  12.99  3/16/2016
5    11   21  10.99  3/17/2016
6    11   22  11.99  3/17/2016
7    11   22  11.99  3/18/2016
8    11   21  10.99  3/18/2016
9    12   22  11.99  3/18/2016
10   12   21  10.99  3/18/2016
11   12   23   5.00  3/19/2016
12   12   23   5.00  3/19/2016

First, let's group it by 'ID1' and 'Date' and aggregate the results as tuples (sorted). I also reset the index, so there is a new columns named 'index'.

gr = df.reset_index().groupby(['ID1','Date'], as_index = False)
df1 = gr.agg(lambda x : tuple(sorted(x)))
df1
Out[425]: 
   ID1       Date     index       ID2           Price
0   11  3/15/2016    (0, 1)  (21, 22)  (10.99, 11.99)
1   11  3/16/2016    (3, 4)  (21, 22)  (10.99, 12.99)
2   11  3/17/2016    (5, 6)  (21, 22)  (10.99, 11.99)
3   11  3/18/2016    (7, 8)  (21, 22)  (10.99, 11.99)
4   12  3/15/2016      (2,)     (23,)          (5.0,)
5   12  3/18/2016   (9, 10)  (21, 22)  (10.99, 11.99)
6   12  3/19/2016  (11, 12)  (23, 23)      (5.0, 5.0)

After all grouping is done, I'll use indices from column 'index' to access rows from df (they'd better be unique). (Notice also that df1.index and df1['index'] are completely different things.)

Now, let's group 'index' (skipping dates):

df2 = df1.groupby(['ID1','ID2','Price'], as_index = False)['index'].sum()
df2
Out[427]: 
   ID1       ID2           Price               index
0   11  (21, 22)  (10.99, 11.99)  (0, 1, 5, 6, 7, 8)
1   11  (21, 22)  (10.99, 12.99)              (3, 4)
2   12  (21, 22)  (10.99, 11.99)             (9, 10)
3   12     (23,)          (5.0,)                (2,)
4   12  (23, 23)      (5.0, 5.0)            (11, 12)

I believe this is the grouping needed for the problem, so we can now add labels to df . For example like this:

df['GID'] = -1
for i, t in enumerate(df2['index']):
    df.loc[t,'GID'] = i

df
Out[430]: 
    ID1  ID2  Price       Date  GID
0    11   21  10.99  3/15/2016    0
1    11   22  11.99  3/15/2016    0
2    12   23   5.00  3/15/2016    3
3    11   21  10.99  3/16/2016    1
4    11   22  12.99  3/16/2016    1
5    11   21  10.99  3/17/2016    0
6    11   22  11.99  3/17/2016    0
7    11   22  11.99  3/18/2016    0
8    11   21  10.99  3/18/2016    0
9    12   22  11.99  3/18/2016    2
10   12   21  10.99  3/18/2016    2
11   12   23   5.00  3/19/2016    4
12   12   23   5.00  3/19/2016    4

Or in a possibly faster but tricky way:

# EXPERIMENTAL CODE!
df3 = df2['index'].apply(pd.Series).stack().reset_index()
df3.index = df3[0].astype(int)
df['GID'] = df3['level_0']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM