[英]Shortest way of splitting a pandas DataFrame column based on another column
Inspiration 灵感
In R, this is very easy 在R中,这非常容易
data("iris")
bartlett.test(Sepal.Length ~ Species,data = iris)
The important thing about the data set is that the column Sepal.Length is numerical, the species is categorical. 关于数据集的重要一点是,Sepal.Length列是数字,种类是分类的。
Problem 问题
In Python scipy.stats.bartlett
would need separate arrays for each species, see docs . 在Python中,
scipy.stats.bartlett
对于每种物种都需要单独的数组,请参阅docs 。
What would be the easiest way to achieve this? 实现这一目标的最简单方法是什么?
An easy way to get the dataset in python: 在python中获取数据集的简单方法:
from sklearn import datasets
iris = datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= ["sepal.length","sepal.width","petal.length","petal.width"] + ['species'])
I really wanted this to work: 我真的希望它能工作:
iris.groupby("species")["sepal.length"].apply(ss.bartlett)
but it didn't due to it needing multiple sample vectors. 但这并不是因为它需要多个样本向量。
Following the groupby pattern you can do a bit of manipulation and do this: 按照groupby模式,您可以进行一些操作并执行以下操作:
gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x).values for x in gb.groups])
the *
unpacks the list into the function, the rest is just to get the groups into the right form for the function to take. *
将列表解压缩到函数中,剩下的只是将组以正确的形式放入函数中。 As mentioned in the comments, the .values
isn't needed here so we can write it as: 如评论中所述,此处不需要
.values
,因此我们可以将其编写为:
gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x) for x in gb.groups])
And just for completion, if you really want to do it in one line: 只是为了完成,如果您真的想一行完成:
ss.bartlett(*[x[1] for x in iris.groupby('species')["sepal.length"]])
But I personally find that less readable. 但我个人认为它的可读性较差。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.