I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:
In [1]: import pandas as pd
my_data = [{'ticket_number' : '001', 'item' : 'tomato', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'candy', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'soup', 'ticket_price' : '21'},
{'ticket_number' : '002', 'item' : 'soup', 'ticket_price' : '12'},
{'ticket_number' : '002', 'item' : 'cola', 'ticket_price' : '12'},
{'ticket_number' : '003', 'item' : 'beef', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'tomato', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'pork', 'ticket_price' : '56'}]
df = pd.DataFrame(my_data)
In [2]: df
Out [2]:
ticket_number ticket_price item
0 001 21 tomato
1 001 21 candy
2 001 21 soup
3 002 12 soup
4 002 12 cola
5 003 56 beef
6 003 56 tomato
7 003 56 pork
I need a DataFrame where each row represents a ticket with all the items bought and the ticket price as columns. In this example:
ticket_number ticket_price item1 item2 item3
0 001 21 tomato candy soup
1 002 12 soup cola
2 003 56 beef tomato pork
I tried using df.groupby(ticket_number).item.value_counts()
, but that does not create new columns. I have never used pivot_table
, maybe it is useful.
Any help would be very appreciated.
Thanks!
One possible way to use groupby to make lists of it that can then be turned into columns:
In [24]: res = df.groupby(['ticket_number', 'ticket_price'])['item'].apply(list).apply(pd.Series)
In [25]: res
Out[25]:
0 1 2
ticket_number ticket_price
001 21 tomato candy soup
002 12 soup cola NaN
003 56 beef tomato pork
Then, after cleaning up this result a bit:
In [27]: res.columns = ['item' + str(i + 1) for i in res.columns]
In [29]: res.reset_index()
Out[29]:
ticket_number ticket_price item1 item2 item3
0 001 21 tomato candy soup
1 002 12 soup cola NaN
2 003 56 beef tomato pork
Another possible way to create a new column which numbers the items in each group with groupby.cumcount
:
In [38]: df['item_number'] = df.groupby('ticket_number').cumcount()
In [39]: df
Out[39]:
item ticket_number ticket_price item_number
0 tomato 001 21 0
1 candy 001 21 1
2 soup 001 21 2
3 soup 002 12 0
4 cola 002 12 1
5 beef 003 56 0
6 tomato 003 56 1
7 pork 003 56 2
And then do some reshaping:
In [40]: df.set_index(['ticket_number', 'ticket_price', 'item_number']).unstack(-1)
Out[40]:
item
item_number 0 1 2
ticket_number ticket_price
001 21 tomato candy soup
002 12 soup cola NaN
003 56 beef tomato pork
From here, with some cleaning of the columns names, you can achieve the same as above.
The reshaping step with set_index
and untack
could also be done with pivot_table
: df.pivot_table(columns=['item_number'], index=['ticket_number', 'ticket _price'], values='item', aggfunc='first')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.