简体   繁体   中英

Pandas dataframe split based on a filter on groupby

I have a pandas dataframe like below

在此处输入图片说明

I want to split the dataframe and create two separate dataframes based on whether I have a unique group of 'O', 'A', 'N', 'value_next' or not. So I did this:

mask = dft.groupby(['O', 'A', 'N', 'value_next']).filter(lambda x: len(x) <= 1)

df1 = dft[mask]
df2 = dft[~mask]

But the line df1 = dft[mask]

gives error

ValueError: Boolean array expected for the condition, not int64

What am I missing?

Here is a slightly different approach using .duplicated instead of groupby/filter which can be really slow if you have a large dft. Note keep=False which marks all duplicate rows, instead of ignoring the first instance of a duplicate which is default behavior

import pandas as pd
import numpy as np

num_rows = 100

np.random.seed(1)

#Creating a test df
dft = pd.DataFrame({
    'time':np.random.randint(5,25,num_rows),
    'O':np.random.randint(1,4,num_rows),
    'A':np.random.randint(1,4,num_rows),
    'N':np.random.randint(1,4,num_rows),
    'value':np.random.randint(10,100,num_rows),
    'value_next':np.random.randint(-10,40,num_rows),
})

#Getting a mask of True if duplicated, False otherwise
is_dup = dft.duplicated(['O', 'A', 'N', 'value_next'],keep=False)

df1 = dft[~is_dup]
df2 = dft[is_dup]

print(df2)

#Quick check that a row in df2 was originally duplicated
dft[
    dft['O'].eq(2) &
    dft['A'].eq(3) &
    dft['N'].eq(1) &
    dft['value_next'].eq(8)
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM