[英]Pandas equivalent to dplyr dot
I am sorry for pretty heavy explanation, but hope you will get the idea.我很抱歉解释太重,但希望你能明白。
I'm R user and I find tidyverse capabilities in data wrangling really powerful.我是 R 用户,我发现 tidyverse 在数据整理方面的功能非常强大。 But recently I have started learning Python, and in particular pandas to extend my opportunities in data analysis.但最近我开始学习 Python,尤其是 Pandas,以扩展我在数据分析方面的机会。 Instinctively I'm trying to do things in pandas as I used to do them while I was using dplyr.我本能地尝试在 Pandas 中做一些事情,就像我在使用 dplyr 时所做的那样。
So my question is whether any equivalent to dplyr dot while you are using method chaining in pandas.所以我的问题是,当您在 Pandas 中使用方法链时,是否与 dplyr dot 等效。
Here example illustrates computing of minimum value from all values that are greater than current value in test_df['data'] per each group and than the same computing but across new column.这里的示例说明了从每个组的 test_df['data'] 中大于当前值的所有值计算最小值,并且比相同的计算但跨新列。
R's Example: R的例子:
require(dplyr)
require(purrr)
test_df = data.frame(group = rep(c(1,2,3), each = 3),
data= c(1:9))
test_df %>%
group_by(group) %>%
mutate(., min_of_max = map_dbl(data, ~data[data > .x] %>% min())) %>%
mutate(., min_of_max_2 = map_dbl(min_of_max, ~min_of_max[min_of_max > .x] %>% min()))
Output:输出:
# A tibble: 9 x 4
# Groups: group [3]
group data min_of_max min_of_max_2
<dbl> <int> <dbl> <dbl>
1 1 1 2 3
2 1 2 3 Inf
3 1 3 Inf Inf
4 2 4 5 6
5 2 5 6 Inf
6 2 6 Inf Inf
7 3 7 8 9
8 3 8 9 Inf
9 3 9 Inf Inf
I know that dplyr doesn't even require dot, but I put it for better understanding the specific of my question我知道 dplyr 甚至不需要点,但我把它放在更好地理解我的问题的具体
Doing the same in Pandas在 Pandas 中做同样的事情
Invalid Example:无效示例:
import pandas as pd
import numpy as np
test_df = (
pd.DataFrame({'A': np.array([1,2,3]*3), 'B': np.array(range(1,10))})
.sort_values(by = ['A', 'B'])
)
(test_df.assign(min_of_max = test_df.apply(lambda x: (test_df.B[(test_df.B > x.B) &
(test_df.A[test_df.A == x.A])]).min(), axis = 1))
.assign(min_of_max2 = 'assume_dot_here'.apply(lambda x: (test_df.min_of_max[(test_df.min_of_max > x.min_of_max) &
(test_df.A[test_df.A == x.A])]).min(), axis = 1)))
In this example putting dot in a second .assign
would be great ability but it doesn't work in pandas.在这个例子中,将 dot 放在第二个.assign
中将是一个很好的能力,但它在.assign
不起作用。
Valid Example, which ruins chain:有效的例子,它破坏了链:
test_df = test_df.assign(min_of_max = test_df.apply(lambda x:
(test_df.B[(test_df.B > x.B) & (test_df.A[test_df.A == x.A])]).min(), axis = 1))
test_df = test_df.assign(min_of_max2 = test_df.apply(lambda x :
(test_df.min_of_max[(test_df.min_of_max > x.min_of_max) & (test_df.A[test_df.A
== x.A])]).min(), axis = 1))
Output:输出:
A B min_of_max min_of_max2
0 1 1 4.0 7.0
3 1 4 7.0 NaN
6 1 7 NaN NaN
1 2 2 5.0 8.0
4 2 5 8.0 NaN
7 2 8 NaN NaN
2 3 3 6.0 9.0
5 3 6 9.0 NaN
8 3 9 NaN NaN
So is there any convenient way to call object from previous part of chain in second .assign
?那么有没有什么方便的方法可以在第二个.assign
从链的前一部分调用对象? Since using test_df.apply()
in second .assign will take initial test_df without computed test_df['min_of_max']
由于在第二个 .assign 中使用test_df.apply()
将采用初始 test_df 而不计算test_df['min_of_max']
Sorry for somewhat unreadable code in Python, I'am still figuring out how to write more clear.抱歉,Python 中的代码有些不可读,我仍在想办法写得更清晰。
In Pandas, run the chain of two assign
calls but do so in any way that does not rely on original data frame context such as with DataFrame.apply
call.在 Pandas 中,运行两个assign
调用链,但以任何不依赖原始数据帧上下文的方式执行,例如使用DataFrame.apply
调用。 Below uses a list comprehension equivalent across index values:下面使用跨索引值的列表理解:
test_df = pd.DataFrame({'group': np.repeat([1,2,3],3), 'data': np.arange(1,10)})
(
test_df.assign(min_of_max = lambda x: [np.min(x["data"].loc[(x["data"] > x["data"].iloc[i]) &
(x["group"] == x["group"].iloc[i])]
) for i in test_df.index.values])
.assign(min_of_max_2 = lambda x: [np.min(x["min_of_max"].loc[(x["min_of_max"] > x["min_of_max"].iloc[i]) &
(x["group"] == x["group"].iloc[i])]
) for i in test_df.index.values])
)
# group data min_of_max min_of_max_2
# 0 1 1 2.0 3.0
# 1 1 2 3.0 NaN
# 2 1 3 NaN NaN
# 3 2 4 5.0 6.0
# 4 2 5 6.0 NaN
# 5 2 6 NaN NaN
# 6 3 7 8.0 9.0
# 7 3 8 9.0 NaN
# 8 3 9 NaN NaN
However, just as you can combine the assignments in dplyr::mutate
, you can do the same by combining the DataFrame.assign
calls by using the lambda
method (not to be confused with lambda
in DataFrame.apply
).但是,就像您可以组合dplyr::mutate
的赋值dplyr::mutate
,您也可以通过使用lambda
方法组合DataFrame.assign
调用来实现相同的DataFrame.assign
(不要与DataFrame.apply
lambda
混淆)。
R电阻
test_df <- data.frame(group = rep(c(1,2,3), each = 3), data = c(1:9))
test_df %>%
group_by(group) %>%
mutate(min_of_max = map_dbl(data, ~data[data > .x] %>% min()),
min_of_max_2 = map_dbl(min_of_max, ~min_of_max[min_of_max > .x] %>% min()))
# # A tibble: 9 x 4
# # Groups: group [3]
# group data min_of_max min_of_max_2
# <dbl> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 Inf
# 3 1 3 Inf Inf
# 4 2 4 5 6
# 5 2 5 6 Inf
# 6 2 6 Inf Inf
# 7 3 7 8 9
# 8 3 8 9 Inf
# 9 3 9 Inf Inf
Pandas熊猫
test_df = pd.DataFrame({'group': np.repeat([1,2,3],3), 'data': np.arange(1,10)})
test_df.assign(min_of_max = lambda x: [np.min(x["data"].loc[(x["data"] > x["data"].iloc[i]) &
(x["group"] == x["group"].iloc[i])]
) for i in test_df.index.values],
min_of_max_2 = lambda x: [np.min(x["min_of_max"].loc[(x["min_of_max"] > x["min_of_max"].iloc[i]) &
(x["group"] == x["group"].iloc[i])]
) for i in test_df.index.values])
# group data min_of_max min_of_max_2
# 0 1 1 2.0 3.0
# 1 1 2 3.0 NaN
# 2 1 3 NaN NaN
# 3 2 4 5.0 6.0
# 4 2 5 6.0 NaN
# 5 2 6 NaN NaN
# 6 3 7 8.0 9.0
# 7 3 8 9.0 NaN
# 8 3 9 NaN NaN
By the way, since Pandas was arguably modeled after R many years ago by Wes McKinney (see paper ), base R tends to be more translatable to Pandas.顺便说一下,由于 Pandas 可以说是多年前 Wes McKinney 模仿 R 建模的(参见论文),因此基础 R 往往更易于翻译为 Pandas。 Below, within
mirrors uses of assign
and sapply
mirrors list comprehension.下面, within
镜像中使用assign
和sapply
镜像列表理解。
Base R基础R
test_df <- within(test_df, {
min_of_max <- sapply(1:nrow(test_df),
function(i) min(data[data > data[i] &
group == group[i]]))
min_of_max_2 <- sapply(1:nrow(test_df),
function(i) min(min_of_max[min_of_max > min_of_max[i] &
group == group[i]]))
})
test_df[c("group", "data", "min_of_max", "min_of_max_2")]
# group data min_of_max min_of_max_2
# 1 1 1 2 3
# 2 1 2 3 Inf
# 3 1 3 Inf Inf
# 4 2 4 5 6
# 5 2 5 6 Inf
# 6 2 6 Inf Inf
# 7 3 7 8 9
# 8 3 8 9 Inf
# 9 3 9 Inf Inf
Guess I have figured out the brief way to refer the object in previous part of chain using lambda functions.猜猜我已经找到了使用 lambda 函数在链的前一部分中引用对象的简要方法。 Passing into assign its argument will be treated as a data frame from previous part of chain.传入 assign 其参数将被视为来自链前一部分的数据帧。
(test_df.assign(min_of_max = test_df.apply(lambda x: (test_df.B[(test_df.B > x.B) &
(test_df.A[test_df.A == x.A])]).min(), axis = 1))
.assign(min_of_max2 = lambda y: y.apply(lambda x: (y.min_of_max[(y.min_of_max > x.min_of_max) &
(y.A[y.A == x.A])]).min(), axis = 1)))
Passing 'lambda y' in second .assign will treat y as a output from previous part in chain在第二个 .assign 中传递 'lambda y' 会将 y 视为链中前一部分的输出
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.