简体   繁体   English

使用 Pandas,根据第二列的最小值从数据框中的一列(对于每组)获取值

[英]With Pandas, get value from one column in dataframe (for each group), based on minimum value of second column

Let's assume we have a dataframe with 3 columns: the_customer , the_date , and the_amount .假设我们有一个包含 3 列的数据the_customerthe_customerthe_datethe_amount We need to create a dataframe that has, for each user, the_amount associated with the earliest / minimum value of the_date for each user.我们需要为每个用户创建一个数据the_amount ,该数据the_amount具有与每个用户的the_amount的最早/最小值相关联的the_date Here's what we're doing so far:这是我们目前正在做的事情:

each_users_first_amount = our_data[['the_customer', 'the_date', 'the_amount']]\
    .sort_values(by='the_date', ascending = True)\
    .groupby('the_customer', as_index=False)\
    .apply(lambda x: x.head(1))\
    .rename(columns = { 'the_date': 'earliest_date', 'the_amount': 'first_amount' })

This approach technically works, however for some reason this function is operating very slow on our data, and I'm not sure which method in the chain is causing the function to run slow ( .apply ?).这种方法在技术上是有效的,但是由于某种原因,这个函数在我们的数据上运行得很慢,我不确定链中的哪个方法导致函数运行缓慢( .apply ?)。 This also seems "hacky" in particular the line .apply(lambda x: x.head(1)) which uses head to grab the first row, which works because we previously sorted.这似乎也很“hacky”,特别是使用head抓取第一行的.apply(lambda x: x.head(1))行,这是因为我们之前已排序。

In particular, it would maybe be helpful if this could be done using .agg() in some way, since we are already using .agg() in another method chain to group the data and compute grouped-by metrics.特别是,如果可以以某种方式使用.agg()来完成这可能会有所帮助,因为我们已经在另一个方法链中使用.agg()来对数据进行分组并计算分组指标。

Using aggregate fuction is not efficient for dataframes of bigger size it consumes more time than interating while working on bigger dataframes.使用聚合函数对于更大尺寸的数据帧效率不高,在处理更大的数据帧时,它比交互消耗更多的时间。 However in your code apply function or iterating is the only possible option so u cant replace it.但是,在您的代码中,应用函数或迭代是唯一可能的选择,因此您无法替换它。 But i think the time taking process in the code is sorting.但我认为代码中的时间过程是排序。 Using sort after groupby might reduce the time complexity as sorting many small subsets of it will be easier than sorting the whole dataframe.在 groupby 之后使用 sort 可能会降低时间复杂度,因为对它的许多小子集进行排序比对整个数据帧进行排序更容易。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python Pandas - 过滤 pandas dataframe 以获取一列中具有最小值的行,以获取另一列中的每个唯一值 - Python Pandas - filter pandas dataframe to get rows with minimum values in one column for each unique value in another column 根据熊猫中第二个数据框的列值从一个数据框删除列 - Dropping column from one dataframe based on column value of second dataframe in pandas Python Pandas 旋转:如何在第一列中分组并为第二列中的每个唯一值创建一个新列 - Python Pandas pivoting: how to group in the first column and create a new column for each unique value from the second column 计算熊猫中多索引DataFrame每列的最小值 - Calculate minimum value for each column of multi-indexed DataFrame in pandas Pandas 按两列分组,并按每组计算第二列值 - Pandas group by two columns and count the second column value by each group 获取 pandas dataframe 列中每个值的平均值 - get the mean of each value in a pandas dataframe column 根据等于 pandas dataframe 中的特定值的列定位最小日期? - Locating minimum date based on column equal to specific value in pandas dataframe? 根据 Pandas 中的 id 将列值从一个数据帧复制到另一个数据帧 - Copy column value from one dataframe to another based on id in Pandas 根据列值将数据从一个 Pandas 数据帧复制到另一个 - Copying data from one pandas dataframe to other based on column value Pandas - 根据其他列中的最小值获取值 - Pandas - Get value based on minimum value in other column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM