简体   繁体   中英

Shortest way of splitting a pandas DataFrame column based on another column

Inspiration

In R, this is very easy

data("iris")
bartlett.test(Sepal.Length ~ Species,data = iris)

The important thing about the data set is that the column Sepal.Length is numerical, the species is categorical.

Problem

In Python scipy.stats.bartlett would need separate arrays for each species, see docs .

What would be the easiest way to achieve this?

An easy way to get the dataset in python:

from sklearn import datasets
iris = datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= ["sepal.length","sepal.width","petal.length","petal.width"] + ['species'])

I really wanted this to work:

iris.groupby("species")["sepal.length"].apply(ss.bartlett)

but it didn't due to it needing multiple sample vectors.

Following the groupby pattern you can do a bit of manipulation and do this:

gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x).values for x in gb.groups])

the * unpacks the list into the function, the rest is just to get the groups into the right form for the function to take. As mentioned in the comments, the .values isn't needed here so we can write it as:

gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x) for x in gb.groups])

And just for completion, if you really want to do it in one line:

ss.bartlett(*[x[1] for x in iris.groupby('species')["sepal.length"]])

But I personally find that less readable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM