简体   繁体   English

Pandas 相当于 dplyr 点

[英]Pandas equivalent to dplyr dot

I am sorry for pretty heavy explanation, but hope you will get the idea.我很抱歉解释太重,但希望你能明白。

I'm R user and I find tidyverse capabilities in data wrangling really powerful.我是 R 用户,我发现 tidyverse 在数据整理方面的功能非常强大。 But recently I have started learning Python, and in particular pandas to extend my opportunities in data analysis.但最近我开始学习 Python,尤其是 Pandas,以扩展我在数据分析方面的机会。 Instinctively I'm trying to do things in pandas as I used to do them while I was using dplyr.我本能地尝试在 Pandas 中做一些事情,就像我在使用 dplyr 时所做的那样。

So my question is whether any equivalent to dplyr dot while you are using method chaining in pandas.所以我的问题是,当您在 Pandas 中使用方法链时,是否与 dplyr dot 等效。

Here example illustrates computing of minimum value from all values that are greater than current value in test_df['data'] per each group and than the same computing but across new column.这里的示例说明了从每个组的 test_df['data'] 中大于当前值的所有值计算最小值,并且比相同的计算但跨新列。

R's Example: R的例子:

require(dplyr)
require(purrr)
test_df = data.frame(group = rep(c(1,2,3), each = 3),
                     data= c(1:9))
test_df %>%
group_by(group) %>%
mutate(., min_of_max = map_dbl(data, ~data[data > .x] %>% min())) %>%
mutate(., min_of_max_2 = map_dbl(min_of_max, ~min_of_max[min_of_max > .x] %>% min()))

Output:输出:

# A tibble: 9 x 4
# Groups:   group [3]
group  data min_of_max min_of_max_2
<dbl> <int>      <dbl>        <dbl>
1     1     1          2            3
2     1     2          3          Inf
3     1     3        Inf          Inf
4     2     4          5            6
5     2     5          6          Inf
6     2     6        Inf          Inf
7     3     7          8            9
8     3     8          9          Inf
9     3     9        Inf          Inf

I know that dplyr doesn't even require dot, but I put it for better understanding the specific of my question我知道 dplyr 甚至不需要点,但我把它放在更好地理解我的问题的具体

Doing the same in Pandas在 Pandas 中做同样的事情

Invalid Example:无效示例:

import pandas as pd
import numpy as np
test_df = (
    pd.DataFrame({'A': np.array([1,2,3]*3), 'B': np.array(range(1,10))})
    .sort_values(by = ['A', 'B'])
)
(test_df.assign(min_of_max = test_df.apply(lambda x: (test_df.B[(test_df.B > x.B) &
                                                           (test_df.A[test_df.A == x.A])]).min(), axis = 1))
    .assign(min_of_max2 = 'assume_dot_here'.apply(lambda x: (test_df.min_of_max[(test_df.min_of_max > x.min_of_max) &
                                                           (test_df.A[test_df.A == x.A])]).min(), axis = 1)))

In this example putting dot in a second .assign would be great ability but it doesn't work in pandas.在这个例子中,将 dot 放在第二个.assign中将是一个很好的能力,但它在.assign不起作用。

Valid Example, which ruins chain:有效的例子,它破坏了链:

test_df = test_df.assign(min_of_max = test_df.apply(lambda x: 
(test_df.B[(test_df.B > x.B) & (test_df.A[test_df.A == x.A])]).min(), axis = 1))

test_df = test_df.assign(min_of_max2 = test_df.apply(lambda x : 
(test_df.min_of_max[(test_df.min_of_max > x.min_of_max) & (test_df.A[test_df.A 
== x.A])]).min(), axis = 1))

Output:输出:

   A  B  min_of_max  min_of_max2
0  1  1         4.0          7.0
3  1  4         7.0          NaN
6  1  7         NaN          NaN
1  2  2         5.0          8.0
4  2  5         8.0          NaN
7  2  8         NaN          NaN
2  3  3         6.0          9.0
5  3  6         9.0          NaN
8  3  9         NaN          NaN

So is there any convenient way to call object from previous part of chain in second .assign ?那么有没有什么方便的方法可以在第二个.assign从链的前一部分调用对象? Since using test_df.apply() in second .assign will take initial test_df without computed test_df['min_of_max']由于在第二个 .assign 中使用test_df.apply()将采用初始 test_df 而不计算test_df['min_of_max']

Sorry for somewhat unreadable code in Python, I'am still figuring out how to write more clear.抱歉,Python 中的代码有些不可读,我仍在想办法写得更清晰。

In Pandas, run the chain of two assign calls but do so in any way that does not rely on original data frame context such as with DataFrame.apply call.在 Pandas 中,运行两个assign调用链,但以任何不依赖原始数据帧上下文的方式执行,例如使用DataFrame.apply调用。 Below uses a list comprehension equivalent across index values:下面使用跨索引值的列表理解:

test_df = pd.DataFrame({'group': np.repeat([1,2,3],3), 'data': np.arange(1,10)})

(
   test_df.assign(min_of_max = lambda x: [np.min(x["data"].loc[(x["data"] > x["data"].iloc[i]) &
                                                               (x["group"] == x["group"].iloc[i])]
                                                ) for i in test_df.index.values])
          .assign(min_of_max_2 = lambda x: [np.min(x["min_of_max"].loc[(x["min_of_max"] > x["min_of_max"].iloc[i]) &
                                                                       (x["group"] == x["group"].iloc[i])]
                                                  ) for i in test_df.index.values])
)

#    group  data  min_of_max  min_of_max_2
# 0      1     1         2.0           3.0
# 1      1     2         3.0           NaN
# 2      1     3         NaN           NaN
# 3      2     4         5.0           6.0
# 4      2     5         6.0           NaN
# 5      2     6         NaN           NaN
# 6      3     7         8.0           9.0
# 7      3     8         9.0           NaN
# 8      3     9         NaN           NaN

However, just as you can combine the assignments in dplyr::mutate , you can do the same by combining the DataFrame.assign calls by using the lambda method (not to be confused with lambda in DataFrame.apply ).但是,就像您可以组合dplyr::mutate的赋值dplyr::mutate ,您也可以通过使用lambda方法组合DataFrame.assign调用来实现相同的DataFrame.assign (不要与DataFrame.apply lambda混淆)。

R电阻

test_df <- data.frame(group = rep(c(1,2,3), each = 3), data = c(1:9))

test_df %>%
  group_by(group) %>%
  mutate(min_of_max = map_dbl(data, ~data[data > .x] %>% min()),
         min_of_max_2 = map_dbl(min_of_max, ~min_of_max[min_of_max > .x] %>% min()))

# # A tibble: 9 x 4
# # Groups:   group [3]
#   group  data min_of_max min_of_max_2
#   <dbl> <int>      <dbl>        <dbl>
# 1     1     1          2            3
# 2     1     2          3          Inf
# 3     1     3        Inf          Inf
# 4     2     4          5            6
# 5     2     5          6          Inf
# 6     2     6        Inf          Inf
# 7     3     7          8            9
# 8     3     8          9          Inf
# 9     3     9        Inf          Inf

Pandas熊猫

test_df = pd.DataFrame({'group': np.repeat([1,2,3],3), 'data': np.arange(1,10)})

test_df.assign(min_of_max = lambda x: [np.min(x["data"].loc[(x["data"] > x["data"].iloc[i]) &
                                                            (x["group"] == x["group"].iloc[i])]
                                             ) for i in test_df.index.values],
               min_of_max_2 = lambda x: [np.min(x["min_of_max"].loc[(x["min_of_max"] > x["min_of_max"].iloc[i]) &
                                                                    (x["group"] == x["group"].iloc[i])]
                                               ) for i in test_df.index.values])

#    group  data  min_of_max  min_of_max_2
# 0      1     1         2.0           3.0
# 1      1     2         3.0           NaN
# 2      1     3         NaN           NaN
# 3      2     4         5.0           6.0
# 4      2     5         6.0           NaN
# 5      2     6         NaN           NaN
# 6      3     7         8.0           9.0
# 7      3     8         9.0           NaN
# 8      3     9         NaN           NaN

By the way, since Pandas was arguably modeled after R many years ago by Wes McKinney (see paper ), base R tends to be more translatable to Pandas.顺便说一下,由于 Pandas 可以说是多年前 Wes McKinney 模仿 R 建模的(参见论文),因此基础 R 往往更易于翻译为 Pandas。 Below, within mirrors uses of assign and sapply mirrors list comprehension.下面, within镜像中使用assignsapply镜像列表理解。

Base R基础R

test_df <- within(test_df, {      
  min_of_max <- sapply(1:nrow(test_df), 
                       function(i) min(data[data > data[i] & 
                                            group == group[i]]))

  min_of_max_2 <- sapply(1:nrow(test_df), 
                         function(i) min(min_of_max[min_of_max > min_of_max[i] & 
                                                    group == group[i]]))      
})

test_df[c("group", "data", "min_of_max", "min_of_max_2")]

#   group data min_of_max min_of_max_2
# 1     1    1          2            3
# 2     1    2          3          Inf
# 3     1    3        Inf          Inf
# 4     2    4          5            6
# 5     2    5          6          Inf
# 6     2    6        Inf          Inf
# 7     3    7          8            9
# 8     3    8          9          Inf
# 9     3    9        Inf          Inf

Guess I have figured out the brief way to refer the object in previous part of chain using lambda functions.猜猜我已经找到了使用 lambda 函数在链的前一部分中引用对象的简要方法。 Passing into assign its argument will be treated as a data frame from previous part of chain.传入 assign 其参数将被视为来自链前一部分的数据帧。

(test_df.assign(min_of_max = test_df.apply(lambda x: (test_df.B[(test_df.B > x.B) &
                                                     (test_df.A[test_df.A == x.A])]).min(), axis = 1))
        .assign(min_of_max2 = lambda y: y.apply(lambda x: (y.min_of_max[(y.min_of_max > x.min_of_max) &
                                                          (y.A[y.A == x.A])]).min(), axis = 1))) 

Passing 'lambda y' in second .assign will treat y as a output from previous part in chain在第二个 .assign 中传递 'lambda y' 会将 y 视为链中前一部分的输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM