Split dataframe based on number of rows with a column value

Question

I have a dataframe that has an animals column with different animals (say ["cat", "dog", "lion"]) as rows and a value corresponding to each animal. There are 10 unique animals and 50 entries of each. The animals are not in any particular order.

I want to split the dataframe into two with one containing 40 of each animal and another containing 10 of each animal. That is one dataframe should contain 40 cats, 40 dogs etc and the other dataframe with 10 cats, 10 dogs etc.

Any help would be greatly appreciated.

I have tried to sort by unique values but it did not work. I am not very familiar with Pandas yet and this is the first time I am using it.

Edit:

Adding a better example of what I need

Animal	value
dog	12
cat	14
dog	10
cat	40
dog	90
dog	80
cat	30
dog	20
cat	20
cat	23

I want to separate this into 2 data frames. In this example the first dataframe would have 3 of each animal and the other one would have 2 of each animal.

Animal	value
dog	12
dog	10
dog	90
cat	14
cat	40
cat	30

Animal	value
dog	80
dog	20
cat	20
cat	23

Answer 1

Does this work? df.groupby('animal', group_keys=False).apply(lambda x: x.sample(frac=0.2)) You could then remove these rows from your original dataframe to create the one with 40 of each animal.

Answer 2

You can get the two dataframes the following way:

df_big = df.groupby('category').apply(lambda x: x.sample(frac=0.8)).reset_index('category', drop=True)
df_small = df.drop(df_big.index)

Answer 3

We can construct a custom function that returns two DataFrames based on the requirement that one should have 40 entries of each animal and the other should contain 10, and then apply the function to the DataFrame, grouped by the 'Animal' column:

def split_df(df):
    df1 = df[df['value'] < 30]  # select rows where 'value' is less than 30
    df2 = df[df['value'] >= 30]  # select rows where 'value' is greater than or equal to 30
    return df1, df2

# apply the custom function to the DataFrame, grouped by 'Animal'
df1, df2 = df.groupby('Animal').apply(split_df)

print(df1)
print(df2)

The two DataFrames that result will be printed to the console as a result. In this example, there will be four entries for each animal in the first DataFrame (df1), and the final 10 items for each animal will be in the second DataFrame (df2) (1 entry for each animal in this example).

Answer 4

Pandas is really powerfull as you can see from @jmendes16 proposal.

What you not mentiin and should think about is, if you want 40 arbritary picked values or the first/last etc. Additionally is the final order important.

If you want to get familar with pandas you can try to do it step by step, by selecting parts of it and combine them. Eg. if you want to get the fourty first dogs, you can do:

df_40 = df[df.Animal == "dog"].iloc[0:40]
df_10 = df[df.Animal == "dog"].iloc[40:50]

Edit: That is not an efficient, but rather educational solution;).

Split dataframe based on number of rows with a column value

Question

4 answers

solution1
1 2022-12-22 17:05:14

solution2
0 2022-12-22 17:19:36

solution3
0 2022-12-22 17:20:14

solution4
0 2022-12-22 17:22:13

Split dataframe based on number of rows with a column value

Question

4 answers

solution1 1 2022-12-22 17:05:14

solution2 0 2022-12-22 17:19:36

solution3 0 2022-12-22 17:20:14

solution4 0 2022-12-22 17:22:13

solution1
1 2022-12-22 17:05:14

solution2
0 2022-12-22 17:19:36

solution3
0 2022-12-22 17:20:14

solution4
0 2022-12-22 17:22:13