简体   繁体   中英

Split pandas df based on unique values

I have the following pandas df.

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)
df

I want to split it into two dfs based on the question_id. Namely, I want to have 80% of the unique question_id's to be in df1 and 20% to be in df2. Rounding up.

Dummy example with the df above: df1 includes ids 1-5 and df2 includes id 6

df1_data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0']]
  

 df2_data = [['6', '5.5', '1.0'],
            ['6', '5.2', '0.0']]

First getting the unique question ids

unique_qid = df['question_id'].unique()
array(['1', '2', '3', '4', '5', '6'], dtype=object)

Then getting first 80% unique question ids and using the corrseponding boolean indexing to get the two output dfs

df1_idx = df['question_id'].isin(unique_qid[:round(0.8 * len(unique_qid))])
df1_data = df.loc[df1_idx, :]
df2_data = df.loc[~df1_idx, :]

df1_data

在此处输入图像描述

df2_data

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM