如何为每个组设置 Pandas 数据框中前几行的值

Question

I am a noob to groupby methods in Pandas and can't seem to get my head wrapped around it.我是 Pandas 中 groupby 方法的菜鸟，似乎无法理解它。 I have data with ~2M records and my current code will take 4 days to execute - due to the inefficient use of 'append'.我有大约 200 万条记录的数据，我当前的代码需要 4 天才能执行 - 由于“附加”的使用效率低下。

I am analyzing data from manufacturing with 2 flags for indicating problems with the test specimens.我正在分析带有 2 个标志的制造数据，以指示测试样本的问题。 The first few flags from each Test_ID should be set to False.每个 Test_ID 的前几个标志应设置为 False。 (Reason: there is not sufficient data to accurately analyze these first few of each group) （原因：没有足够的数据来准确分析每组的前几个）

My inefficient attempt (right result, but not fast enought for 2M rows):我的低效尝试（正确的结果，但对于 2M 行来说不够快）：

df = pd.DataFrame({'Test_ID' : ['foo', 'foo', 'foo', 'foo', 
                                'bar', 'bar', 'bar'],
                'TEST_Date' : ['2020-01-09 09:49:31',
                                '2020-01-09 12:16:15',
                                '2020-01-09 12:47:44',
                                '2020-01-09 14:39:05',
                                '2020-01-09 17:39:47',
                                '2020-01-09 20:44:58',
                                '2020-01-10 18:40:47'],
                'Flag1' : [True, False, True, False, True, False, False],
                'Flag2' : [True, False, False, False, True, False, False],
                })
 
#generate a list of Test_IDs
Test_IDs = list(df['Test_ID'].unique())  

#generate a list of columns in the dataframe
cols = list(df)  

#generate a new dataframe with the same columns as the original
df_output = pd.DataFrame(columns = cols) 

for i in Test_IDs:
    #split the data into groups, iterate over each group
    df_2 = df[df['Test_ID'] == i].copy()   
    
    #set the first two rows of Flag1 to False for each group
    df_2.iloc[:2, df_2.columns.get_loc('Flag1')] = 0  
    
    #set the first three rows of Flag2 to False for each group
    df_2.iloc[:3, df_2.columns.get_loc('Flag2')] = 0
    
    df_output = df_output.append(df_2)   #add the latest group onto the output df
print(df_output)

Input:输入：

   Flag1  Flag2            TEST_Date Test_ID
0   True   True  2020-01-09 09:49:31     foo
1  False  False  2020-01-09 12:16:15     foo
2   True  False  2020-01-09 12:47:44     foo
3  False  False  2020-01-09 14:39:05     foo
4   True   True  2020-01-09 17:39:47     bar
5  False  False  2020-01-09 20:44:58     bar
6  False  False  2020-01-10 18:40:47     bar

Output:输出：

   Flag1  Flag2            TEST_Date Test_ID
0  False  False  2020-01-09 09:49:31     foo
1  False  False  2020-01-09 12:16:15     foo
2   True  False  2020-01-09 12:47:44     foo
3  False  False  2020-01-09 14:39:05     foo
4  False  False  2020-01-09 17:39:47     bar
5  False  False  2020-01-09 20:44:58     bar
6  False  False  2020-01-10 18:40:47     bar

Answer 1

Let's do groupby().cumcount() :让我们做groupby().cumcount() ：

# enumeration of rows within each `Test_ID`
enum = df.groupby('Test_ID').cumcount()

# overwrite the Flags
df.loc[enum < 2, 'Flag1'] = False
df.loc[enum < 3, 'Flag2'] = False

Output:输出：

  Test_ID            TEST_Date  Flag1  Flag2
0     foo  2020-01-09 09:49:31  False  False
1     foo  2020-01-09 12:16:15  False  False
2     foo  2020-01-09 12:47:44   True  False
3     foo  2020-01-09 14:39:05  False  False
4     bar  2020-01-09 17:39:47  False  False
5     bar  2020-01-09 20:44:58  False  False
6     bar  2020-01-10 18:40:47  False  False

如何为每个组设置 Pandas 数据框中前几行的值

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-11-24 18:29:35

如何为每个组设置 Pandas 数据框中前几行的值

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-11-24 18:29:35

解决方案1
3 已采纳 2020-11-24 18:29:35