简体   繁体   中英

Concatenate multiple rows into one row pandas

From the dataframe which row contains one specific product,

data = [['Alpha', '#10','Apple','2020-10-01',4], 
        ['Alpha', '#10','Tomatoes','2020-10-15',1.5], 
        ['Beta', '#12','Banana', '2019-03-06', 2],
        ['Beta', '#14','Dragonfruit', '2020-04-05', 3],
        ['Charlie', '#16','Watermelon', '2019-01-02', 5]]
df = pd.DataFrame(data, columns = ['customer_name', 'order_number','product_variant','date','net_sales'])

I want to merge the rows so that one row contains one order number. Expected df

data_expected = [['Alpha', '#10',np.NaN,'Apple','Tomatoes','2020-10-01','2020-10-15',5.5],
                 ['Beta', '#12','#14','Banana','Dragonfruit','2019-03-06','2020-04-05',5], 
                 ['Charlie', '#16',np.NaN,'Watermelon',np.NaN,'2019-01-02',np.NaN,5]]
df_expected = pd.DataFrame(data_expected, columns = ['customer_name','order_number_1', 'order_number_2','product_variant_1','product_variant_2','date_1','date_2','net_sales'])

In the real dataframe, one customer may have more than 2 products within the same order number, and may have more than 2 order numbers, and more than 2 dates as well (as in the real world).

  1. You can first create a cc column that takes the cumulative count
  2. Then, use .groupby to calculate the sum of net sales, which you will add to the dataframe later.
  3. pivot the dataframe and and rename the multi-index column as one column joining together with _ . #pivot has a major bug in previous versions. You can upgrade with pip install pandas --upgrade
  4. Create the new aggreagated net_sales column by setting to s -- the series you created earlier, prior to manipulating the shape of the dataframe.

df['cc'] = (df.groupby('customer_name').cumcount() + 1).astype(str)
s = df.groupby('customer_name')['net_sales'].sum()
df = df.pivot(index=['customer_name'], columns='cc', values=['order_number','product_variant','date'])
df.columns = ['_'.join(col) for col in df.columns]
df['net_sales'] = s
df

Out[1]: 
order_number_1 order_number_2 product_variant_1  \
customer_name                                                   
Alpha                    #10            #10             Apple   
Beta                     #12            #14            Banana   
Charlie                  #16            NaN        Watermelon   

              product_variant_2      date_1      date_2  net_sales  
customer_name                                                       
Alpha                  Tomatoes  2020-10-01  2020-10-15        5.5  
Beta                Dragonfruit  2019-03-06  2020-04-05        5.0  
Charlie                     NaN  2019-01-02         NaN        5.0  

Appreciate an excellent accepted answer exists, but here is my 'one-liner'

df2 = df.groupby('customer_name').apply(lambda x:pd.DataFrame(x.reset_index().unstack()).transpose())
df2

gives you this

|                | ('customer_name', 0)   | ('customer_name', 1)   | ('date', 0)   | ('date', 1)   |   ('index', 0) |   ('index', 1) |   ('net_sales', 0) |   ('net_sales', 1) | ('order_number', 0)   | ('order_number', 1)   | ('product_variant', 0)   | ('product_variant', 1)   |
|:---------------|:-----------------------|:-----------------------|:--------------|:--------------|---------------:|---------------:|-------------------:|-------------------:|:----------------------|:----------------------|:-------------------------|:-------------------------|
| ('Alpha', 0)   | Alpha                  | Alpha                  | 2020-10-01    | 2020-10-15    |              0 |              1 |                  4 |                1.5 | #10                   | #10                   | Apple                    | Tomatoes                 |
| ('Beta', 0)    | Beta                   | Beta                   | 2019-03-06    | 2020-04-05    |              2 |              3 |                  2 |                3   | #12                   | #14                   | Banana                   | Dragonfruit              |
| ('Charlie', 0) | Charlie                | nan                    | 2019-01-02    | nan           |              4 |            nan |                  5 |              nan   | #16                   | nan                   | Watermelon               | nan                      |

which is almost as required except for some aggregation and cleanup, along the lines of

del df2['customer_name']
del df2['index']
df2['net_sales_total'] = df2['net_sales'].sum(axis=1)
del df2['net_sales']
df2.columns = [c[0] + '_' + str(c[1]) for c in df2.columns]
df2.rename(columns={'net_sales_total_':'net_sales'}, inplace=True)

so we get

|                | date_0     | date_1     | order_number_0   | order_number_1   | product_variant_0   | product_variant_1   |   net_sales |
|:---------------|:-----------|:-----------|:-----------------|:-----------------|:--------------------|:--------------------|------------:|
| ('Alpha', 0)   | 2020-10-01 | 2020-10-15 | #10              | #10              | Apple               | Tomatoes            |         5.5 |
| ('Beta', 0)    | 2019-03-06 | 2020-04-05 | #12              | #14              | Banana              | Dragonfruit         |         5   |
| ('Charlie', 0) | 2019-01-02 | nan        | #16              | nan              | Watermelon          | nan                 |         5   |

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM