简体   繁体   中英

Split dataframe based on number of rows with a column value

I have a dataframe that has an animals column with different animals (say ["cat", "dog", "lion"]) as rows and a value corresponding to each animal. There are 10 unique animals and 50 entries of each. The animals are not in any particular order.

I want to split the dataframe into two with one containing 40 of each animal and another containing 10 of each animal. That is one dataframe should contain 40 cats, 40 dogs etc and the other dataframe with 10 cats, 10 dogs etc.

Any help would be greatly appreciated.

I have tried to sort by unique values but it did not work. I am not very familiar with Pandas yet and this is the first time I am using it.

Edit:

Adding a better example of what I need

Animal value
dog 12
cat 14
dog 10
cat 40
dog 90
dog 80
cat 30
dog 20
cat 20
cat 23

I want to separate this into 2 data frames. In this example the first dataframe would have 3 of each animal and the other one would have 2 of each animal.

Animal value
dog 12
dog 10
dog 90
cat 14
cat 40
cat 30
Animal value
dog 80
dog 20
cat 20
cat 23

Does this work? df.groupby('animal', group_keys=False).apply(lambda x: x.sample(frac=0.2)) You could then remove these rows from your original dataframe to create the one with 40 of each animal.

You can get the two dataframes the following way:

df_big = df.groupby('category').apply(lambda x: x.sample(frac=0.8)).reset_index('category', drop=True)
df_small = df.drop(df_big.index)

We can construct a custom function that returns two DataFrames based on the requirement that one should have 40 entries of each animal and the other should contain 10, and then apply the function to the DataFrame, grouped by the 'Animal' column:

def split_df(df):
    df1 = df[df['value'] < 30]  # select rows where 'value' is less than 30
    df2 = df[df['value'] >= 30]  # select rows where 'value' is greater than or equal to 30
    return df1, df2

# apply the custom function to the DataFrame, grouped by 'Animal'
df1, df2 = df.groupby('Animal').apply(split_df)

print(df1)
print(df2)

The two DataFrames that result will be printed to the console as a result. In this example, there will be four entries for each animal in the first DataFrame (df1), and the final 10 items for each animal will be in the second DataFrame (df2) (1 entry for each animal in this example).

Pandas is really powerfull as you can see from @jmendes16 proposal.

What you not mentiin and should think about is, if you want 40 arbritary picked values or the first/last etc. Additionally is the final order important.

If you want to get familar with pandas you can try to do it step by step, by selecting parts of it and combine them. Eg. if you want to get the fourty first dogs, you can do:

df_40 = df[df.Animal == "dog"].iloc[0:40]
df_10 = df[df.Animal == "dog"].iloc[40:50]

Edit: That is not an efficient, but rather educational solution;).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM