[英]With Pandas, get value from one column in dataframe (for each group), based on minimum value of second column
Let's assume we have a dataframe with 3 columns: the_customer
, the_date
, and the_amount
.假设我们有一个包含 3 列的数据the_customer
: the_customer
、 the_date
和the_amount
。 We need to create a dataframe that has, for each user, the_amount
associated with the earliest / minimum value of the_date
for each user.我们需要为每个用户创建一个数据the_amount
,该数据the_amount
具有与每个用户的the_amount
的最早/最小值相关联的the_date
。 Here's what we're doing so far:这是我们目前正在做的事情:
each_users_first_amount = our_data[['the_customer', 'the_date', 'the_amount']]\
.sort_values(by='the_date', ascending = True)\
.groupby('the_customer', as_index=False)\
.apply(lambda x: x.head(1))\
.rename(columns = { 'the_date': 'earliest_date', 'the_amount': 'first_amount' })
This approach technically works, however for some reason this function is operating very slow on our data, and I'm not sure which method in the chain is causing the function to run slow ( .apply
?).这种方法在技术上是有效的,但是由于某种原因,这个函数在我们的数据上运行得很慢,我不确定链中的哪个方法导致函数运行缓慢( .apply
?)。 This also seems "hacky" in particular the line .apply(lambda x: x.head(1))
which uses head
to grab the first row, which works because we previously sorted.这似乎也很“hacky”,特别是使用head
抓取第一行的.apply(lambda x: x.head(1))
行,这是因为我们之前已排序。
In particular, it would maybe be helpful if this could be done using .agg()
in some way, since we are already using .agg()
in another method chain to group the data and compute grouped-by metrics.特别是,如果可以以某种方式使用.agg()
来完成这可能会有所帮助,因为我们已经在另一个方法链中使用.agg()
来对数据进行分组并计算分组指标。
Using aggregate fuction is not efficient for dataframes of bigger size it consumes more time than interating while working on bigger dataframes.使用聚合函数对于更大尺寸的数据帧效率不高,在处理更大的数据帧时,它比交互消耗更多的时间。 However in your code apply function or iterating is the only possible option so u cant replace it.但是,在您的代码中,应用函数或迭代是唯一可能的选择,因此您无法替换它。 But i think the time taking process in the code is sorting.但我认为代码中的时间过程是排序。 Using sort after groupby might reduce the time complexity as sorting many small subsets of it will be easier than sorting the whole dataframe.在 groupby 之后使用 sort 可能会降低时间复杂度,因为对它的许多小子集进行排序比对整个数据帧进行排序更容易。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.