简体   繁体   English

通过 x 和 y 绘制第三个变量的平滑平均值

[英]Plot smoothed average of third variable by x and y

I am trying to make a 2D plot where the x and y axes are predictor variables.我正在尝试制作一个 2D 图,其中 x 和 y 轴是预测变量。 I would like to summarize a third variable smoothly as the counts at a particular coordinate are very low.我想平稳地总结第三个变量,因为特定坐标处的计数非常低。

For example, I might want to plot the probability of default against assets and debt.例如,我可能想绘制资产和债务违约的概率。 This is similar to a density plot, but rather than plot the smoothed density of the observations, I want to plot an arbitrary smoothed value such as default rate.这类似于密度图,但不是绘制观测值的平滑密度,我想绘制任意平滑值,例如默认率。

I have tried using stat_density_2d in ggplot2 but have not figured out how to make it summarize a third variable as the "density" instead of observation counts.我曾尝试在ggplot2使用stat_density_2d但还没有想出如何让它将第三个变量总结为“密度”而不是观察计数。

Sample data:样本数据:

data(iris)
plt <- data.frame(iris[c(1,2)], y=as.numeric(iris$Species == "setosa"))

I want the output to look something like this:我希望输出看起来像这样:

library(ggplot2)

ggplot(plt, aes(x=Sepal.Length, y=Sepal.Width)) + 
  stat_density_2d(aes(fill= ..density..), geom="tile", contour=FALSE)

在此处输入图片说明

But instead of the color representing the density of observations.但不是代表观察密度的颜色。 I want it to represent a summarized variable.我希望它代表一个汇总变量。 In this case, the probability that species == "setosa"在这种情况下,物种==“setosa”的概率

UPDATE2: Based on the discussion in chat , it looks like you're referring to a two-dimensional kernel smoothing function. UPDATE2:根据chat 中讨论,您似乎指的是二维内核平滑函数。 The smoothie package might have what you need. smoothie可能有你需要的东西。

Regardless of how you estimate the loan default probability (the variable that gets mapped to the fill color, which I've called p.default below) at a given (x,y) point (eg, binned averages, logistic regression, kernel smoothing, etc.), you can create the plot with something like this:无论您如何估计给定 (x,y) 点处的贷款违约概率(映射到填充颜色的变量,我在下面将其称为p.default )(例如,分箱平均值、逻辑回归、内核平滑)等),您可以使用以下内容创建绘图:

ggplot(df, aes(assets, debt, fill=p.default)) + geom_tile() 

UPDATE: Regarding your comment, for the iris example, you'd need to average the y values over regions of Sepal.Length and Sepal.Width to get the average probability.更新:关于您的评论,对于iris示例,您需要对Sepal.LengthSepal.Width区域的 y 值Sepal.Length Sepal.Width以获得平均概率。 These data are pretty sparse, so you'll need relatively large cells to get more than one observation per cell.这些数据非常稀疏,因此您需要相对较大的单元格来获得每个单元格的多个观察结果。 Also, Sepal.Length and Sepal.Width fall in almost completely different regions for each species, so you'll still get all 1's or all 0's in almost all cells.此外,对于每个物种, Sepal.LengthSepal.Width位于几乎完全不同的区域,因此您仍然会在几乎所有单元格中获得全 1 或全 0。 In the example below, I just assign random values of 1 and 0 in order to get a mix of 1s and 0s in several cells.在下面的示例中,我只是分配了 1 和 0 的随机值,以便在多个单元格中混合使用 1 和 0。

library(ggplot2)
library(dplyr)

# Fake data
set.seed(5)
plt <- data.frame(iris[c(1,2)], y=sample(0:1, nrow(iris), replace=TRUE))

In the code below, we use the cut function to cut Sepal.Length and Sepal.Width into 10 ranges each.在下面的代码中,我们使用cut函数将Sepal.LengthSepal.Width切成 10 个范围。 Then we average the 1s and 0s in each cell to get the average of y for each cell.然后我们平均每个单元格中的 1 和 0 以获得每个单元格的y平均值。 This average y value is then represented by the fill color gradient.这个平均y值然后由填充颜色渐变表示。

plt %>% group_by(Sepal.Length = cut(Sepal.Length, 10),
                 Sepal.Width = cut(Sepal.Width, 10)) %>%
  summarise(y=mean(y)) %>%
  ggplot(aes(Sepal.Width, Sepal.Length, fill=y)) +
  geom_tile() + 
  theme_classic()

在此处输入图片说明

Or, we could fit a logistic regression model, which would give us predictions of y for any combination of Sepal.Length and Sepal.Width :或者,我们可以拟合一个逻辑回归模型,它可以为我们提供Sepal.LengthSepal.Width任意组合的y预测:

# Logistic regression model
m1 = glm(y ~ poly(Sepal.Length,2)*poly(Sepal.Width,2), family="binomial", data=plt)

# Get predictions on a grid of values
df = expand.grid(Sepal.Length=seq(4,8,length=100), Sepal.Width=seq(2,5,length=100))
df$y.pred = predict(m1, newdata=df, type="response")

ggplot(df, aes(Sepal.Width, Sepal.Length, fill=y.pred)) +
  geom_tile() + 
  theme_classic() +
  scale_fill_gradient2(low="blue",mid="yellow",high="red", midpoint=0.5,limits=c(0,1))

在此处输入图片说明

The general idea is that you need a single value (let's call it z ) to associate with each (x,y) point on your graph.一般的想法是您需要一个值(我们称之为z )来与图形上的每个 (x,y) 点相关联。 You can calculate those z values by averaging over regions in the (x,y) plane, with a model, etc. Once you have the z values that go with each (x,y) point, you can generate a tile plot where z is the fill aesthetic.您可以通过对 (x,y) 平面中的区域、模型等进行平均来计算这些z值。一旦您有了每个 (x,y) 点的z值,您就可以生成一个瓦片图,其中zfill美学。

Original Answer原答案

It sounds like maybe you want a heat map.听起来您可能想要一张热图。 The fill color would represent the value of the third variable, in this case probability of default.填充颜​​色将代表第三个变量的值,在这种情况下是违约概率。 Perhaps something like this:也许是这样的:

library(ggplot2)

# Fake data
df = expand.grid(income=seq(1,1e5,length=100), debt=seq(1,5e5,length=100))
df$p.default = df$income - 0.3*df$debt
df$p.default = df$p.default - max(df$p.default)
df$p.default = abs(df$p.default)/max(abs(df$p.default))

ggplot(df, aes(income, debt, fill=p.default)) + 
  geom_tile() +
  scale_fill_gradient2(limits=c(0,1), low="blue", mid="yellow", high="red", midpoint=0.5)

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM