如何僅保留熊貓數據幀每組的前n％行？

Question

我看到這個問題的一個變體，要求將每個組的前n行保留在pandas數據框中，解決方案使用n作為絕對數而不是百分比，此處Pandas在每個組中獲得最前n條記錄。 但是，在我的數據框中，每個組中都有不同數量的行，我想保留每個組中前n％個行。 我將如何解決這個問題？

Answer 1

在groupby之前，您可以構造布爾值標志和過濾器系列。 首先，讓我們創建一個示例數據框，並查看第一個系列中每個唯一值的行數：

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))

print(df[0].value_counts())

0    6
1    4
Name: 0, dtype: int64

然后定義一個分數，例如低於50％，並構造一個布爾級數進行過濾：

n = 0.5

g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n

然后應用條件，將索引設置為第一個序列，並（如果需要）對索引進行排序：

df = df.loc[flags].set_index(0).sort_index()

print(df)

   1  2
0      
0  1  1
0  1  1
0  1  0
1  1  1
1  1  0

如您所見，結果數據幀僅具有3 0索引和2 1索引，在每種情況下均為原始數據幀數量的一半。

Answer 2

這是您提到的帖子中的一些答案的另一種選擇

首先，這里有一個快速功能，可以向上或向下取整。 如果我們希望數據框的前30％的行長8行，那么我們將嘗試使用2.4行。 因此，我們將需要向上或向下取整。

我的首選是四舍五入。 這是因為，對於eaxample，如果我們要占據50％的行，但是只有一組只有一行，那么我們仍然會保留那一行。 我將其分開放置，以便您可以根據需要更改舍入

def round_func(x, up=True):
    '''Function to round up or round down a float'''
    if up:
        return int(x+1)
    else:
        return int(x)

接下來，我制作一個要使用的數據框，並將參數p設置為每個組中應保留的行的分數。 一切都遵循了，我已經發表了評論，希望您可以遵循。

import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})

p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply(                        # group by the ids
    lambda x: x.reset_index()['value'].nlargest(        # in each group take the top rows by column 'value'
        round_func(x.count().max()*p)))        # calculate how many to keep from each group

df_top = df_top.reset_index().drop('level_1', axis=1)   # make the dataframe nice again

df看起來像這樣

   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

df_top看起來像這樣

   id  value
0   1      3
1   2      4
2   2      3
3   3      1
4   4      1

如何僅保留熊貓數據幀每組的前n％行？

問題描述

2 個解決方案

解決方案1
2 已采納 2018-11-17 22:49:08

解決方案2
1 2018-11-17 23:14:32

如何僅保留熊貓數據幀每組的前n％行？

問題描述

2 個解決方案

解決方案1 2 已采納 2018-11-17 22:49:08

解決方案2 1 2018-11-17 23:14:32

解決方案1
2 已采納 2018-11-17 22:49:08

解決方案2
1 2018-11-17 23:14:32