简体   繁体   中英

Pandas: left merge on two DF using columns, but take only the last (or mean) of another column

I have two DFs as follows:

| shopID | itemID |
|--------|--------|
|      2 |     30 |
|      2 |     31 |
|      2 |     32 |
|      2 |     33 |
|      2 |     38 |
| date | shopID | itemID | price  | cnt |
|------|--------|--------|--------|-----|
|  0.0 |    2.0 |   33.0 |  499.0 | 1.0 |
|  0.0 |    2.0 |  482.0 | 3300.0 | 1.0 |
|  0.0 |    2.0 |  491.0 |  600.0 | 1.0 |
|  0.0 |    2.0 |  839.0 | 3300.0 | 1.0 |
|  0.0 |    2.0 | 1007.0 |  449.0 | 3.0 |
...

The second one is a time series DF, where date is the month (for simplicity, starts at 0 and ends at 33). The combination of shopID and itemID is not guaranteed to appear in both DFs. I want to left merge the DF1 with DF2 on shopID and itemID . I did:

pd.merge(df1, df2, how="left", on=["shopID", "itemID"])

As usual, it gives me the following DF:


| shopID | itemID | date | price  | cnt |
|--------|--------|------|--------|-----|
|      2 |     30 |  2.0 | 359.00 | 1.0 |
|      2 |     30 |  5.0 | 399.00 | 1.0 |
|      2 |     30 | 15.0 | 169.00 | 1.0 |
|      2 |     30 | 16.0 | 169.00 | 1.0 |
|      2 |     31 |  1.0 | 699.00 | 4.0 |
|      2 |     31 |  2.0 | 698.50 | 1.0 |
|      2 |     31 |  3.0 | 699.00 | 1.0 |
|      2 |     31 | 16.0 | 415.92 | 1.0 |
|      2 |     31 | 33.0 | 399.00 | 1.0 |
|      2 |     32 | 12.0 | 119.00 | 1.0 |
...

My question is: I want to merge them and have or the latest price (where date of each combination shopID-itemID is largest). How can I do this?

EDIT: Expected output (last month only)

| shopID | itemID | date | prince | cnt |
|--------|--------|------|--------|-----|
|      2 |     30 | 16.0 |  169.0 | 1.0 |
|      2 |     31 | 33.0 | 399.00 | 1.0 |
|      2 |     32 | 31.0 | 149.00 | 1.0 |
...

Hard to answer without more information, is this just a simple max date for each itemID? If so can use drop_duplicates like so:

df = pd.DataFrame({'shopID': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
 'itemID': [30, 30, 30, 30, 31, 31, 31, 31, 31, 32],
 'date': [2.0, 5.0, 15.0, 16.0, 1.0, 2.0, 3.0, 16.0, 33.0, 12.0],
 'price': [359.0,
  399.0,
  169.0,
  169.0,
  699.0,
  698.5,
  699.0,
  415.92,
  399.0,
  119.0],
 'cnt': [1.0, 1.0, 1.0, 1.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0]})

df.sort_values(by=['itemID', 'date']).drop_duplicates(subset=['itemID'], keep='last')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM