Python - 熊猫 - 找到最频繁的组合与领带分辨率 - 性能

Question

Data数据
I have a data set that looks something like this:我有一个看起来像这样的数据集：

| id    | string_col_A | string_col_B | creation_date |
|-------|--------------|--------------|---------------|
| x12ga | STR_X1       | STR_Y1       | 2020-11-01    |
| x12ga | STR_X1       | STR_Y1       | 2020-10-10    |
| x12ga | STR_X2       | STR_Y2       | 2020-11-06    |
| x21ab | STR_X4       | STR_Y4       | 2020-11-06    |
| x21ab | STR_X5       | STR_Y5       | 2020-11-02    |
| x11aa | STR_X3       | STR_Y3       | None          |

Goal目标

I want to find the most frequent combination of values for each id.我想为每个 id 找到最频繁的值组合。
Further, in case of tie I want to extract the combination that is most recent.此外，在平局的情况下，我想提取最新的组合。

ie The result for the above table would be:即上表的结果是：

| id    | string_col_A | string_col_B |
|-------|--------------|--------------|
| x12ga | STR_X1       | STR_Y1       |
| x21ab | STR_X4       | STR_Y4       |
| x11aa | STR_X3       | STR_Y3       |

Explanation解释

For x12ga , the explanation is straightforward.对于x12ga ，解释很简单。 STR_X1, STR_Y1 occurs twice and STR_X2, STR_Y2 occurs only once (i,e no tie resolution) STR_X1, STR_Y1 出现两次，STR_X2, STR_Y2 只出现一次（即没有平局决议）
x11aa is straightforward as well, there is only one row x11aa很简单，只有一行
For x21ab , both combination has 1 row, but STR_X4, STR_Y4 is most recent.对于x21ab ，两个组合都有 1 行，但 STR_X4, STR_Y4 是最新的。

Code代码
Here is what I have so far:这是我到目前为止所拥有的：


def reducer(id_group):
    id_with_sizes = id_group.groupby(
            ["id", "string_col_A", "string_col_B"], dropna=False).agg({
            'creation_date': [len, max]
            }).reset_index()
    id_with_sizes.columns = [
            "id", "string_col_A", "string_col_B", "row_count",
            "recent_date"
            ]
    id_with_sizes.sort_values(by=["row_count", "recent_date"],
                           ascending=[False, False],
                           inplace=True)
    return id_with_sizes.head(1).drop(["recent_date", "row_count"], axis=1)

I call the above methods like so:我像这样调用上述方法：

assignment =  all_data.groupby("id").apply(inventor_reduce)

The Problem问题
The above code when testing with data works fine, but the actual dataset that I am working with has more than 10M rows, with ~3M ids.上面的代码在测试数据时工作正常，但我正在使用的实际数据集有超过 1000 万行，大约有 300 万个 ID。 Consequently, to process 10K IDS it takes 5 minutes and overall it would take 25 hours to process.因此，处理 10K IDS 需要 5 分钟，总共需要 25 小时来处理。 I would like to improve the performance.我想提高性能。

The solution解决方案
I have seen questions on stackoverflow (and elsewhere) about getting frequent combinations (albeit without tie-resolution) and about vectorizing the process to improve performance.我在 stackoverflow（和其他地方）上看到了关于获得频繁组合（尽管没有领带解决方案）和关于对过程进行矢量化以提高性能的问题。 I am not quite sure how to achieve both with my problem above.我不太确定如何解决我上面的问题。

Ideally, the solution would still be pandas based (code looks and reads better with pandas)理想情况下，解决方案仍然是基于熊猫的（使用熊猫代码看起来和阅读起来更好）

Answer 1

You could create a series s that combines both columns您可以创建一系列的s ，结合两列
Return the index of the max count返回最大计数的索引
Filter by that index.按该索引过滤。 NOTE: If you are on an earlier version of pandas, then take out , sort=False from the .groupby code and sort at the end.注意：如果您使用的是早期版本的.groupby代码中取出, sort=False并在最后进行排序。

-- ——

s = df['string_col_A'] + df['string_col_B']
df['max'] = df.groupby(['id',s])['id'].transform('count')
df = df.iloc[df.groupby('id', sort=False)['max'].idxmax().values].drop(['max', 'creation_date'], axis=1)
df
Out[1]: 
      id string_col_A string_col_B
0  x12ga       STR_X1       STR_Y1
3  x21ab       STR_X4       STR_Y4
5  x11aa       STR_X3       STR_Y3

Answer 2

You need to groupby only by the id column and find the most-frequent data (mode) based on this.您只需要按id列进行分组，并基于此找到最常用的数据（模式）。

To make things easier you can create another column combined_str :为了使事情更容易，您可以创建另一列combined_str ：

df['combined_str'] = df['string_col_A'] + df['string_col_B']

group by `id` and reduce using the `pd.Series.mode` function:按`id`分组并使用`pd.Series.mode`函数减少：

df = df.sort_values(by=['creation_date'])
df = df.groupby(['id'])['combined_str'].agg(most_common = ('combined_str', pd.Series.mode))

Answer 3

Let us try groupby with transform , then get the count of most common value, then sort_values with drop_duplicates让我们尝试groupby和transform ，然后获取最常见值的计数，然后sort_values和drop_duplicates

df['help'] = df.groupby(['id','string_col_A','string_col_B'])['string_col_A'].transform('count')
out = df.sort_values(['help','creation_date'],na_position='first').drop_duplicates('id',keep='last').drop(['help','creation_date'],1)
out
Out[122]: 
      id string_col_A string_col_B
3  x21ab       STR_X4       STR_Y4
5  x11aa       STR_X3       STR_Y3
0  x12ga       STR_X1       STR_Y1

Python - 熊猫 - 找到最频繁的组合与领带分辨率 - 性能

问题描述

3 个解决方案

解决方案1
1 2020-11-07 00:14:57

解决方案2
1 2020-11-07 00:23:41

group by `id` and reduce using the `pd.Series.mode` function:按`id`分组并使用`pd.Series.mode`函数减少：

解决方案3
1 已采纳 2020-11-07 00:27:15

Python - 熊猫 - 找到最频繁的组合与领带分辨率 - 性能

问题描述

3 个解决方案

解决方案1 1 2020-11-07 00:14:57

解决方案2 1 2020-11-07 00:23:41

group by id and reduce using the pd.Series.mode function:按id分组并使用pd.Series.mode函数减少：

解决方案3 1 已采纳 2020-11-07 00:27:15

解决方案1
1 2020-11-07 00:14:57

解决方案2
1 2020-11-07 00:23:41

group by `id` and reduce using the `pd.Series.mode` function:按`id`分组并使用`pd.Series.mode`函数减少：

解决方案3
1 已采纳 2020-11-07 00:27:15