简体   繁体   English

根据具有列值的行数拆分数据框

[英]Split dataframe based on number of rows with a column value

I have a dataframe that has an animals column with different animals (say ["cat", "dog", "lion"]) as rows and a value corresponding to each animal.我有一个数据框,其中有一个动物列,其中包含不同的动物(比如 ["cat"、"dog"、"lion"])作为行和对应于每只动物的值。 There are 10 unique animals and 50 entries of each.有 10 种独特的动物,每种动物有 50 个条目。 The animals are not in any particular order.这些动物没有任何特定的顺序。

I want to split the dataframe into two with one containing 40 of each animal and another containing 10 of each animal.我想将数据框分成两部分,一个包含每只动物 40 只,另一个包含每只动物 10 只。 That is one dataframe should contain 40 cats, 40 dogs etc and the other dataframe with 10 cats, 10 dogs etc.也就是说,一个数据框应该包含 40 只猫、40 条狗等,而另一个数据框应该包含 10 只猫、10 条狗等。

Any help would be greatly appreciated.任何帮助将不胜感激。

I have tried to sort by unique values but it did not work.我试图按唯一值排序,但没有成功。 I am not very familiar with Pandas yet and this is the first time I am using it.我对 Pandas 还不是很熟悉,这是我第一次使用它。

Edit:编辑:

Adding a better example of what I need添加一个更好的例子来说明我需要什么

Animal动物 value价值
dog 12 12
cat 14 14
dog 10 10
cat 40 40
dog 90 90后
dog 80 80
cat 30 30
dog 20 20
cat 20 20
cat 23 23

I want to separate this into 2 data frames.我想把它分成 2 个数据框。 In this example the first dataframe would have 3 of each animal and the other one would have 2 of each animal.在这个例子中,第一个数据框每只动物有 3 个,另一个数据框每只动物有 2 个。

Animal动物 value价值
dog 12 12
dog 10 10
dog 90 90后
cat 14 14
cat 40 40
cat 30 30
Animal动物 value价值
dog 80 80
dog 20 20
cat 20 20
cat 23 23

Does this work?这行得通吗? df.groupby('animal', group_keys=False).apply(lambda x: x.sample(frac=0.2)) You could then remove these rows from your original dataframe to create the one with 40 of each animal. df.groupby('animal', group_keys=False).apply(lambda x: x.sample(frac=0.2))然后,您可以从原始数据框中删除这些行,以创建每只动物 40 只的行。

You can get the two dataframes the following way:您可以通过以下方式获取两个数据帧:

df_big = df.groupby('category').apply(lambda x: x.sample(frac=0.8)).reset_index('category', drop=True)
df_small = df.drop(df_big.index)

We can construct a custom function that returns two DataFrames based on the requirement that one should have 40 entries of each animal and the other should contain 10, and then apply the function to the DataFrame, grouped by the 'Animal' column:我们可以构造一个自定义函数,根据要求一个返回每个动物 40 个条目,另一个应包含 10 个条目的要求,返回两个 DataFrame,然后将该函数应用于按“动物”列分组的 DataFrame:

def split_df(df):
    df1 = df[df['value'] < 30]  # select rows where 'value' is less than 30
    df2 = df[df['value'] >= 30]  # select rows where 'value' is greater than or equal to 30
    return df1, df2

# apply the custom function to the DataFrame, grouped by 'Animal'
df1, df2 = df.groupby('Animal').apply(split_df)

print(df1)
print(df2)

The two DataFrames that result will be printed to the console as a result.产生的两个数据帧将作为结果打印到控制台。 In this example, there will be four entries for each animal in the first DataFrame (df1), and the final 10 items for each animal will be in the second DataFrame (df2) (1 entry for each animal in this example).在此示例中,第一个 DataFrame (df1) 中的每只动物将有四个条目,每个动物的最后 10 个项目将在第二个 DataFrame (df2) 中(在此示例中,每个动物有一个条目)。

Pandas is really powerfull as you can see from @jmendes16 proposal.从 @jmendes16 提案中可以看出,Pandas 真的很强大。

What you not mentiin and should think about is, if you want 40 arbritary picked values or the first/last etc. Additionally is the final order important.你没有提及但应该考虑的是,如果你想要 40 个任意选择的值或第一个/最后一个等。此外,最终顺序很重要。

If you want to get familar with pandas you can try to do it step by step, by selecting parts of it and combine them.如果你想熟悉 pandas,你可以尝试一步一步地做,通过选择它的部分并将它们组合起来。 Eg.例如。 if you want to get the fourty first dogs, you can do:如果你想得到前四十只狗,你可以这样做:

df_40 = df[df.Animal == "dog"].iloc[0:40]
df_10 = df[df.Animal == "dog"].iloc[40:50]

Edit: That is not an efficient, but rather educational solution;).编辑:这不是一个有效的解决方案,而是一种教育解决方案;)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM