[英]Plot smoothed average of third variable by x and y
I am trying to make a 2D plot where the x and y axes are predictor variables.我正在尝试制作一个 2D 图,其中 x 和 y 轴是预测变量。 I would like to summarize a third variable smoothly as the counts at a particular coordinate are very low.
我想平稳地总结第三个变量,因为特定坐标处的计数非常低。
For example, I might want to plot the probability of default against assets and debt.例如,我可能想绘制资产和债务违约的概率。 This is similar to a density plot, but rather than plot the smoothed density of the observations, I want to plot an arbitrary smoothed value such as default rate.
这类似于密度图,但不是绘制观测值的平滑密度,我想绘制任意平滑值,例如默认率。
I have tried using stat_density_2d
in ggplot2
but have not figured out how to make it summarize a third variable as the "density" instead of observation counts.我曾尝试在
ggplot2
使用stat_density_2d
但还没有想出如何让它将第三个变量总结为“密度”而不是观察计数。
Sample data:样本数据:
data(iris)
plt <- data.frame(iris[c(1,2)], y=as.numeric(iris$Species == "setosa"))
I want the output to look something like this:我希望输出看起来像这样:
library(ggplot2)
ggplot(plt, aes(x=Sepal.Length, y=Sepal.Width)) +
stat_density_2d(aes(fill= ..density..), geom="tile", contour=FALSE)
But instead of the color representing the density of observations.但不是代表观察密度的颜色。 I want it to represent a summarized variable.
我希望它代表一个汇总变量。 In this case, the probability that species == "setosa"
在这种情况下,物种==“setosa”的概率
UPDATE2: Based on the discussion in chat , it looks like you're referring to a two-dimensional kernel smoothing function. UPDATE2:根据chat 中的讨论,您似乎指的是二维内核平滑函数。 The
smoothie
package might have what you need. smoothie
包可能有你需要的东西。
Regardless of how you estimate the loan default probability (the variable that gets mapped to the fill color, which I've called p.default
below) at a given (x,y) point (eg, binned averages, logistic regression, kernel smoothing, etc.), you can create the plot with something like this:无论您如何估计给定 (x,y) 点处的贷款违约概率(映射到填充颜色的变量,我在下面将其称为
p.default
)(例如,分箱平均值、逻辑回归、内核平滑)等),您可以使用以下内容创建绘图:
ggplot(df, aes(assets, debt, fill=p.default)) + geom_tile()
UPDATE: Regarding your comment, for the iris
example, you'd need to average the y values over regions of Sepal.Length
and Sepal.Width
to get the average probability.更新:关于您的评论,对于
iris
示例,您需要对Sepal.Length
和Sepal.Width
区域的 y 值Sepal.Length
Sepal.Width
以获得平均概率。 These data are pretty sparse, so you'll need relatively large cells to get more than one observation per cell.这些数据非常稀疏,因此您需要相对较大的单元格来获得每个单元格的多个观察结果。 Also,
Sepal.Length
and Sepal.Width
fall in almost completely different regions for each species, so you'll still get all 1's or all 0's in almost all cells.此外,对于每个物种,
Sepal.Length
和Sepal.Width
位于几乎完全不同的区域,因此您仍然会在几乎所有单元格中获得全 1 或全 0。 In the example below, I just assign random values of 1 and 0 in order to get a mix of 1s and 0s in several cells.在下面的示例中,我只是分配了 1 和 0 的随机值,以便在多个单元格中混合使用 1 和 0。
library(ggplot2)
library(dplyr)
# Fake data
set.seed(5)
plt <- data.frame(iris[c(1,2)], y=sample(0:1, nrow(iris), replace=TRUE))
In the code below, we use the cut
function to cut Sepal.Length
and Sepal.Width
into 10 ranges each.在下面的代码中,我们使用
cut
函数将Sepal.Length
和Sepal.Width
切成 10 个范围。 Then we average the 1s and 0s in each cell to get the average of y
for each cell.然后我们平均每个单元格中的 1 和 0 以获得每个单元格的
y
平均值。 This average y
value is then represented by the fill color gradient.这个平均
y
值然后由填充颜色渐变表示。
plt %>% group_by(Sepal.Length = cut(Sepal.Length, 10),
Sepal.Width = cut(Sepal.Width, 10)) %>%
summarise(y=mean(y)) %>%
ggplot(aes(Sepal.Width, Sepal.Length, fill=y)) +
geom_tile() +
theme_classic()
Or, we could fit a logistic regression model, which would give us predictions of y
for any combination of Sepal.Length
and Sepal.Width
:或者,我们可以拟合一个逻辑回归模型,它可以为我们提供
Sepal.Length
和Sepal.Width
任意组合的y
预测:
# Logistic regression model
m1 = glm(y ~ poly(Sepal.Length,2)*poly(Sepal.Width,2), family="binomial", data=plt)
# Get predictions on a grid of values
df = expand.grid(Sepal.Length=seq(4,8,length=100), Sepal.Width=seq(2,5,length=100))
df$y.pred = predict(m1, newdata=df, type="response")
ggplot(df, aes(Sepal.Width, Sepal.Length, fill=y.pred)) +
geom_tile() +
theme_classic() +
scale_fill_gradient2(low="blue",mid="yellow",high="red", midpoint=0.5,limits=c(0,1))
The general idea is that you need a single value (let's call it z
) to associate with each (x,y) point on your graph.一般的想法是您需要一个值(我们称之为
z
)来与图形上的每个 (x,y) 点相关联。 You can calculate those z
values by averaging over regions in the (x,y) plane, with a model, etc. Once you have the z
values that go with each (x,y) point, you can generate a tile plot where z
is the fill
aesthetic.您可以通过对 (x,y) 平面中的区域、模型等进行平均来计算这些
z
值。一旦您有了每个 (x,y) 点的z
值,您就可以生成一个瓦片图,其中z
是fill
美学。
Original Answer原答案
It sounds like maybe you want a heat map.听起来您可能想要一张热图。 The fill color would represent the value of the third variable, in this case probability of default.
填充颜色将代表第三个变量的值,在这种情况下是违约概率。 Perhaps something like this:
也许是这样的:
library(ggplot2)
# Fake data
df = expand.grid(income=seq(1,1e5,length=100), debt=seq(1,5e5,length=100))
df$p.default = df$income - 0.3*df$debt
df$p.default = df$p.default - max(df$p.default)
df$p.default = abs(df$p.default)/max(abs(df$p.default))
ggplot(df, aes(income, debt, fill=p.default)) +
geom_tile() +
scale_fill_gradient2(limits=c(0,1), low="blue", mid="yellow", high="red", midpoint=0.5)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.