[英]How to randomly select rows based on multiple conditions
I have two datasets with 20 rows each.我有两个数据集,每个数据集有 20 行。 I am looking to randomly select 10 rows from each dataset following the criteria below.我希望按照以下标准从每个数据集中随机 select 10 行。
df1 group: df1组:
df2 group: df2组:
6 terrestrial and 4 aquatic ecosystems for both 6 个陆地生态系统和 4 个水生生态系统
df1.query("Class = Mammal").sample(n=8)
df1.query("Class = Reptile").sample(n=2)
I've seen solutions like this that should work, but I can't figure out how to include the ecosystems requirement.我见过这样的解决方案应该可行,但我不知道如何包含生态系统要求。 AKA I want 8 mammals and 2 reptiles selected from group 1, ensuring that 6 of them come from terrestrial ecosystems and 4 from aquatic. AKA 我想要从第 1 组中选出 8 只哺乳动物和 2 只爬行动物,确保其中 6 只来自陆地生态系统,4 只来自水生生态系统。 I think there should be a way to do this with a groupby function of the two columns, but I haven't yet figured that out.我认为应该有办法用两列的 groupby function 来做到这一点,但我还没有想出来。
Sample data:样本数据:
Common name常用名 | Class Class | Ecosystem生态系统 |
---|---|---|
Lion狮子 | Mammal哺乳动物 | Terrestrial地面 |
Humpback whale座头鲸 | Mammal哺乳动物 | Aquatic水 |
Crocodile鳄鱼 | Reptile爬虫 | Aquatic水 |
I don't know how to do it in a clean way with just the built-in pandas functions like groupby
.我不知道如何使用groupby
等内置 pandas 函数以干净的方式进行操作。 That said, here's a solution using random
and lists.也就是说,这是一个使用random
和列表的解决方案。
import random
animal_class = ["Mammal"] * 8 + ["Reptile"] * 2
ecosystem = ["Terrestrial"] * 6 + ["Aquatic"] * 4
random.shuffle(ecosystem) # randomly shuffle ecosystem
df1_selected = pd.DataFrame(columns=df1.columns)
for i in range(10):
df1_selected = df1_selected.append(
df1.query(f"Class = {animal_class[i]} and Ecosystem = {ecosystem[i]}").sample(n=1)
)
Just change the animal_class
to do the same thing for df2.只需更改animal_class
即可为 df2 执行相同的操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.