简体   繁体   English

根据另一列拆分pandas DataFrame列的最短方法

[英]Shortest way of splitting a pandas DataFrame column based on another column

Inspiration 灵感

In R, this is very easy 在R中,这非常容易

data("iris")
bartlett.test(Sepal.Length ~ Species,data = iris)

The important thing about the data set is that the column Sepal.Length is numerical, the species is categorical. 关于数据集的重要一点是,Sepal.Length列是数字,种类是分类的。

Problem 问题

In Python scipy.stats.bartlett would need separate arrays for each species, see docs . 在Python中, scipy.stats.bartlett对于每种物种都需要单独的数组,请参阅docs

What would be the easiest way to achieve this? 实现这一目标的最简单方法是什么?

An easy way to get the dataset in python: 在python中获取数据集的简单方法:

from sklearn import datasets
iris = datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= ["sepal.length","sepal.width","petal.length","petal.width"] + ['species'])

I really wanted this to work: 我真的希望它能工作:

iris.groupby("species")["sepal.length"].apply(ss.bartlett)

but it didn't due to it needing multiple sample vectors. 但这并不是因为它需要多个样本向量。

Following the groupby pattern you can do a bit of manipulation and do this: 按照groupby模式,您可以进行一些操作并执行以下操作:

gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x).values for x in gb.groups])

the * unpacks the list into the function, the rest is just to get the groups into the right form for the function to take. *将列表解压缩到函数中,剩下的只是将组以正确的形式放入函数中。 As mentioned in the comments, the .values isn't needed here so we can write it as: 如评论中所述,此处不需要.values ,因此我们可以将其编写为:

gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x) for x in gb.groups])

And just for completion, if you really want to do it in one line: 只是为了完成,如果您真的想一行完成:

ss.bartlett(*[x[1] for x in iris.groupby('species')["sepal.length"]])

But I personally find that less readable. 但我个人认为它的可读性较差。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Pandas数据框中基于另一列的文本拆分一列文本 - Splitting one column's text based on another column's in Pandas dataframe 基于 pandas DataFrame 中另一列中的字符串对列中的拆分字符串值进行矢量化 - Vectorizing splitting string values in a column based on strings in another column in a pandas DataFrame 基于 Pandas DataFrame 中另一列的 Sum 列 - Sum column based on another column in Pandas DataFrame 基于另一列的 Pandas 数据框比例列 - Pandas dataframe scale column based on another column 基于另一列追加Pandas DataFrame列 - Appending Pandas DataFrame column based on another column 根据Pandas数据框中的另一列重塑列 - Reshaping a column based on another column in a pandas dataframe 有没有办法根据 pandas dataframe 中另一列的值获取日期时间范围? - Is there a way to get datetime ranges based on the value of another column in a pandas dataframe? 寻找一种更简单的方法根据另一列的值在熊猫数据框中输入 0 或 1 - Looking for an easier way to enter a 0 or 1 in a pandas dataframe based on the value of another column 基于另一个数据框 python pandas 替换列值 - 更好的方法? - Replace column values based on another dataframe python pandas - better way? 通过将另一列拆分两次在Pandas DataFrame中创建新列 - Creating a new column in Pandas DataFrame by splitting another column twice
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM