How do I concatenate columns in two pandas dataframes with different indexes and non-unique keys

Question

I have one dataframe called products that looks like this:

   order_number  sku  units revenue
1  5000          754  1     20.0
2  5000          900  4     30.0
3  5001          754  2     40.0
4  5002          754  10    200.0
.  ...           ...  ..    ...

and another called orders that looks like this

   date    order_number  units revenue  country new_customer ...
1  1-jan   5000          5     50.0     russia  yes          
2  1-jan   5001          2     40.0     china   yes          
3  2-jan   5002          10    200.0    france  no
4  2-jan   5003          1     70.0     brazil  yes
.  ....    ...           ..    ...      ...

I would like to create a single dataframe, which has the rows from the products dataframe but additionally has the columns from the orders dataframe, where the order number in orders matches the order number in products .

I've tried to find a way to express this via both pandas.concat and pandas.merge , but I can't get around the problem that the key I'm joining on (order_number) is unique in the orders dataframe but not in the products dataframe.

How do I do a many-to-one join like this in pandas?

Answer 1

I think you are looking for join (you have to provide a suffix since you have a duplicate column revenue ):

>>> import pandas as pd
>>> products = pd.DataFrame({'order_number': [5000, 5000, 5001, 5002, 5004],
...                          'sku':          [ 754,  900,  754,  754,  900],
...                          'revenue':      [20.0, 30.0, 40.0,200.0, 90.0]})
>>> orders   = pd.DataFrame({'order_number': [5000, 5001, 5002, 5003],
...                          'units':        [   5,    2,   10,    1],
...                          'revenue':      [50.0, 40.0,200.0, 70.0]})
>>> products.join(orders.set_index('order_number'), 'order_number', rsuffix='_o')
   order_number  revenue  sku  revenue_o  units
0          5000       20  754         50      5
1          5000       30  900         50      5
2          5001       40  754         40      2
3          5002      200  754        200     10
4          5004       90  900        NaN    NaN

Edit : the same result can be achieved with products.merge(orders, 'left', 'order_number', suffixes=('', '_o'))

How do I concatenate columns in two pandas dataframes with different indexes and non-unique keys

Question

1 answers

solution1
3 ACCPTED 2016-05-24 13:09:10

How do I concatenate columns in two pandas dataframes with different indexes and non-unique keys

Question

1 answers

solution1 3 ACCPTED 2016-05-24 13:09:10

solution1
3 ACCPTED 2016-05-24 13:09:10