简体   繁体   中英

plyr or dplyr in Python

This is more of a conceptual question, I do not have a specific problem.

I am learning python for data analysis, but I am very familiar with R - one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the. notation from object programming)

   data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....)

in which I create multiple summary calculations at the same time

How do I do that in python, because

df[...].groupby(.....).sum() only sums columns, 

while on R I can have one mean, one sum, one special function, etc. on one call

I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time

in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful - so what am I missing about pandas or python -

My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains

I'm also a big fan of dplyr for R and am working to improve my knowledge of Pandas. Since you don't have a specific problem, I'd suggest checking out the post below that breaks down the entire introductory dplyr vignette and shows how all of it can be done with Pandas.

For example, the author demonstrates chaining with the pipe operator in R:

 flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
       ) %>%
   filter(arr > 30 | dep > 30)

And here is the Pandas implementation:

flights.groupby(['year', 'month', 'day'])
   [['arr_delay', 'dep_delay']]
   .mean()
   .query('arr_delay > 30 | dep_delay > 30')

There are many more comparisons of how to implement dplyr like operations with Pandas at the original post. http://nbviewer.ipython.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

One could simply use dplyr from Python.

There is an interface to dplyr in rpy2 (introduced with rpy2-2.7.0) that lets you write things like:

dataf = (DataFrame(mtcars).
         filter('gear>3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

There is an example in the documentation . This part of the doc is (also) a jupyter notebook. Look for the links near the top of page.

An other answer to the question is comparing R's dplyr and pandas (see @lgallen). That same R one-liner chaining dplyr statements write's essentially the same in rpy2's interface to dplyr.

R:

flights %>%
   group_by(year, month, day) %>%
   select(arr_delay, dep_delay) %>%
   summarise(
      arr = mean(arr_delay, na.rm = TRUE),
      dep = mean(dep_delay, na.rm = TRUE)
      ) %>%
   filter(arr > 30 | dep > 30)

Python+rpy2:

(DataFrame(flights).
 group_by('year', 'month', 'day').
 select('arr_delay', 'dep_delay').
 summarize(arr = 'mean(arr_delay, na.rm=TRUE)',
           dep = 'mean(dep_delay, na.rm=TRUE)').
 filter('arr > 30 | dep > 30'))

I think you're looking for the agg function , which is applied to groupby objects.

From the docs:

In [48]: grouped = df.groupby('A')

In [49]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[49]: 
          sum      mean       std
A                                
bar  0.443469  0.147823  0.301765
foo  2.529056  0.505811  0.96

The most similar way to use dplyr in python, is with the dfply package. Here is an example.

R dplyr

library(nycflights13)
library(dplyr)

flights %>%
  filter(hour > 10) %>% # step 1
  mutate(speed = distance / (air_time * 60)) %>% # step 2
  group_by(origin) %>% # step 3a
  summarize(mean_speed = sprintf("%0.6f",mean(speed, na.rm = T))) %>% # step 3b
  arrange(desc(mean_speed)) # step 4

# A tibble: 3 x 2
  origin mean_speed
  <chr>  <chr>     
1 EWR    0.109777  
2 JFK    0.109427  
3 LGA    0.107362 

Python dfply

from dfply import *
import pandas as pd

flight_data = pd.read_csv('nycflights13.csv')

(flight_data >>
  mask(X.hour > 10) >> # step 1
  mutate(speed = X.distance / (X.air_time * 60)) >> # step 2
  group_by(X.origin) >> # step 3a
  summarize(mean_speed = X.speed.mean()) >> # step 3b
  arrange(X.mean_speed, ascending=False) # step 4
)


Out[1]: 
  origin  mean_speed
0    EWR    0.109777
1    JFK    0.109427
2    LGA    0.107362

Python Pandas

flight_data.loc[flight_data['hour'] > 10, 'speed'] = flight_data['distance'] / (flight_data['air_time'] * 60)
result = flight_data.groupby('origin', as_index=False)['speed'].mean()
result.sort_values('speed', ascending=False)

Out[2]: 
  origin     speed
0    EWR  0.109777
1    JFK  0.109427
2    LGA  0.107362

Note : For more information you can check the following link .

for dplyr i use dfply which have the same syntax except we use '>>' as pipe operator while in dplyr we used %>% You may use plotnine as ggplot2. I am not sharing the code for dfply as it is already shared above however you can check the below link for plotnine

https://plotnine.readthedocs.io/en/stable/gallery.html

Now we have a close port of dplyr and other related packages from R to python:

https://github.com/pwwang/datar

Disclaimer: I am the author of the package.

One more example of group-by aggregations in R dplyr and Python Pandas. Using the iris dataset grouping by Species and summarise max,mean, median and min of each column:

library(tidyverse)

iris %>% group_by(Species) %>% 
  summarise(max(Sepal.Length),mean(Sepal.Width),median(Petal.Width),min(Petal.Length))

# A tibble: 3 x 5
  Species    `max(Sepal.Length)` `mean(Sepal.Width)` `median(Petal.Width)` `min(Petal.Length)`
  <fct>                    <dbl>               <dbl>                 <dbl>               <dbl>
1 setosa                     5.8                3.43                   0.2                 1  
2 versicolor                 7                  2.77                   1.3                 3  
3 virginica                  7.9                2.97                   2                   4.5

write_csv(iris, "iris.csv")

Same Thing with Pandas:

import pandas as pd
import numpy as np
df = pd.read_csv("iris.csv")

df_gb = pd.DataFrame()
df_gb['max Sepal.Length'] = df.groupby(['Species']).max()['Sepal.Length']
df_gb['mean Sepal.Width)'] = df.groupby(['Species']).mean()['Sepal.Width']                        
df_gb['median Petal.Width'] = df.groupby(['Species']).median()['Petal.Width']
df_gb['min Petal.Length'] = df.groupby(['Species']).min()['Petal.Length']                        
df_gb                        

           max Sepal.Length mean Sepal.Width)   median Petal.Width  min Petal.Length
Species             
setosa                 5.8             3.428                    0.2              1.0
versicolor             7.0             2.770                    1.3              3.0
virginica              7.9             2.974                    2.0              4.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM