[英]Pandas pivot or groupby for dynamically generated columns
I have a dataframe with sales information in a supermarket. 我在超市有一个带有销售信息的数据框。 Each row in the dataframe represents an item, with several characteristics as columns. 数据框中的每一行代表一个项目,具有多个特征作为列。 The original DataFrame is something like this: 原始的DataFrame是这样的:
In [1]: import pandas as pd
my_data = [{'ticket_number' : '001', 'item' : 'tomato', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'candy', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'soup', 'ticket_price' : '21'},
{'ticket_number' : '002', 'item' : 'soup', 'ticket_price' : '12'},
{'ticket_number' : '002', 'item' : 'cola', 'ticket_price' : '12'},
{'ticket_number' : '003', 'item' : 'beef', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'tomato', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'pork', 'ticket_price' : '56'}]
df = pd.DataFrame(my_data)
In [2]: df
Out [2]:
ticket_number ticket_price item
0 001 21 tomato
1 001 21 candy
2 001 21 soup
3 002 12 soup
4 002 12 cola
5 003 56 beef
6 003 56 tomato
7 003 56 pork
I need a DataFrame where each row represents a ticket with all the items bought and the ticket price as columns. 我需要一个DataFrame,其中每一行代表一张票,所有购买的物品和票价格作为列。 In this example: 在此示例中:
ticket_number ticket_price item1 item2 item3
0 001 21 tomato candy soup
1 002 12 soup cola
2 003 56 beef tomato pork
I tried using df.groupby(ticket_number).item.value_counts()
, but that does not create new columns. 我尝试使用df.groupby(ticket_number).item.value_counts()
,但这不会创建新列。 I have never used pivot_table
, maybe it is useful. 我从未使用过pivot_table
,也许它很有用。
Any help would be very appreciated. 任何帮助将不胜感激。
Thanks! 谢谢!
One possible way to use groupby to make lists of it that can then be turned into columns: 一种使用groupby制作其列表的方法,然后可以将其转换为列:
In [24]: res = df.groupby(['ticket_number', 'ticket_price'])['item'].apply(list).apply(pd.Series)
In [25]: res
Out[25]:
0 1 2
ticket_number ticket_price
001 21 tomato candy soup
002 12 soup cola NaN
003 56 beef tomato pork
Then, after cleaning up this result a bit: 然后,清除此结果后:
In [27]: res.columns = ['item' + str(i + 1) for i in res.columns]
In [29]: res.reset_index()
Out[29]:
ticket_number ticket_price item1 item2 item3
0 001 21 tomato candy soup
1 002 12 soup cola NaN
2 003 56 beef tomato pork
Another possible way to create a new column which numbers the items in each group with groupby.cumcount
: 创建新列的另一种可能方法是用groupby.cumcount
对每个组中的项目进行groupby.cumcount
:
In [38]: df['item_number'] = df.groupby('ticket_number').cumcount()
In [39]: df
Out[39]:
item ticket_number ticket_price item_number
0 tomato 001 21 0
1 candy 001 21 1
2 soup 001 21 2
3 soup 002 12 0
4 cola 002 12 1
5 beef 003 56 0
6 tomato 003 56 1
7 pork 003 56 2
And then do some reshaping: 然后进行一些重塑:
In [40]: df.set_index(['ticket_number', 'ticket_price', 'item_number']).unstack(-1)
Out[40]:
item
item_number 0 1 2
ticket_number ticket_price
001 21 tomato candy soup
002 12 soup cola NaN
003 56 beef tomato pork
From here, with some cleaning of the columns names, you can achieve the same as above. 从这里开始,通过一些列名称的清理,您可以实现与上面相同的效果。
The reshaping step with set_index
and untack
could also be done with pivot_table
: df.pivot_table(columns=['item_number'], index=['ticket_number', 'ticket _price'], values='item', aggfunc='first')
使用set_index
和untack
重塑的步骤也可以通过pivot_table
完成: df.pivot_table(columns=['item_number'], index=['ticket_number', 'ticket _price'], values='item', aggfunc='first')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.