简体   繁体   English

使用Seaborn和Statsmodels在一个图中显示数据和模型预测

[英]Showing data and model predictions in one plot using Seaborn and Statsmodels

Seaborn is a great package for doing some high-level plotting with pretty outputs. Seaborn是一个很好的软件包,可用于以一些漂亮的输出进行高级绘图。 However, I'm struggling a little with using Seaborn to overlay both data and model predictions from an externally-fit model. 但是,使用Seaborn叠加来自外部拟合模型的数据和模型预测时,我有些费力。 In this example I am fitting models in Statsmodels that are too complex for Seaborn to do out-of-the-box, but I think the problem is more general (ie if I have model predictions and want to visualise both them and data using Seaborn). 在此示例中,我在Statsmodels中拟合模型,这些模型对于Seaborn而言太复杂而无法直接使用,但是我认为问题更普遍(即,如果我有模型预测并想使用Seaborn可视化它们和数据)。

Let's start with imports and a dataset: 让我们从导入和数据集开始:

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import patsy
import itertools
import matplotlib.pyplot as plt

np.random.seed(12345)

# make a data frame with one continuous and two categorical variables:
df = pd.DataFrame({'x1': np.random.normal(size=100),
                     'x2': np.tile(np.array(['a', 'b']), 50),
                     'x3': np.repeat(np.array(['c', 'd']), 50)})

# create a design matrix using patsy:
X = patsy.dmatrix('x1 * x2 * x3', df)

# some random beta weights:
betas = np.random.normal(size=X.shape[1])

# create the response variable as the noisy linear combination of predictors:
df['y'] = np.inner(X, betas) + np.random.normal(size=100)

We fit a model in statsmodels containing all predictor variables and their interactions: 我们在包含所有预测变量及其相互作用的statsmodels中拟合模型:

# fit a model with all interactions
fit = smf.ols('y ~ x1 * x2 * x3', df).fit()
print(fit.summary())

Since in this case we have all combinations of variables specified, and our model predictions are linear, it would suffice for plotting to add a new "predictions" column to the dataframe containing the model predictions. 由于在这种情况下,我们指定了变量的所有组合,并且我们的模型预测是线性的,因此通过绘图将新的“ predictions”列添加到包含模型预测的数据框中就足够了。 However, that's not very general (imagine our model is nonlinear and so we want our plots to show smooth curves), so instead I make a new dataframe with all combinations of predictors, then generate predictions: 但是,这不是很一般(假设我们的模型是非线性的,因此我们希望我们的图显示平滑曲线),所以我改为使用所有预测变量组合创建一个新的数据框,然后生成预测:

# create a new dataframe of predictions, using pandas' expand grid:
def expand_grid(data_dict):
    """ A port of R's expand.grid function for use with Pandas dataframes.

    from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid

    """
    rows = itertools.product(*data_dict.values())
    return pd.DataFrame.from_records(rows, columns=data_dict.keys())


# build a new matrix with expand grid:

preds = expand_grid(
                {'x1': np.linspace(df['x1'].min(), df['x1'].max(), 2),
                 'x2': ['a', 'b'],
                 'x3': ['c', 'd']})
preds['yhat'] = fit.predict(preds)

The preds dataframe looks like this: preds数据preds如下所示:

  x3        x1 x2      yhat
0  c -2.370232  a -1.555902
1  c -2.370232  b -2.307295
2  c  3.248944  a -1.555902
3  c  3.248944  b -2.307295
4  d -2.370232  a -1.609652
5  d -2.370232  b -2.837075
6  d  3.248944  a -1.609652
7  d  3.248944  b -2.837075

Since Seaborn plot commands (unlike ggplot2 commands in R) appear to accept one and only one dataframe, we need to merge our predictions into the raw data: 由于Seaborn绘图命令(与R中的ggplot2命令不同)似乎只接受一个数据帧,因此我们需要将预测合并到原始数据中:

# append to df:
merged = df.append(preds)

We can now plot the model predictions along with the data, with our continuous variable x1 as the x-axis: 现在,我们可以将连续变量x1作为x轴,将模型预测与数据一起绘制:

# plot using seaborn:
sns.set_style('white')
sns.set_context('talk')
g = sns.FacetGrid(merged, hue='x2', col='x3', size=5)
# use the `map` method to add stuff to the facetgrid axes:
g.map(plt.plot, "x1", "yhat")
g.map(plt.scatter, "x1", "y")
g.add_legend()
g.fig.subplots_adjust(wspace=0.3)
sns.despine(offset=10);

在此处输入图片说明

So far so good. 到目前为止,一切都很好。 Now imagine that we didn't measure the continuous variable x1 , and we only know about the other two (categorical) variables (ie, we have a 2x2 factorial design). 现在假设我们没有测量连续变量x1 ,而我们仅了解其他两个(分类)变量(即,我们有2x2阶乘设计)。 How can we plot the model predictions against data in this case? 在这种情况下,我们如何针对数据绘制模型预测?

fit = smf.ols('y ~ x2 * x3', df).fit()
print(fit.summary())

preds = expand_grid(
                {'x2': ['a', 'b'],
                 'x3': ['c', 'd']})
preds['yhat'] = fit.predict(preds)
print(preds)

# append to df:
merged = df.append(preds)

Well, we can plot the model predictions using sns.pointplot or similar, like so: 好了,我们可以使用sns.pointplot或类似方法绘制模型预测,如下所示:

# plot using seaborn:
g = sns.FacetGrid(merged, hue='x3', size=4)
g.map(sns.pointplot, 'x2', 'yhat')
g.add_legend();
sns.despine(offset=10);

在此处输入图片说明

Or the data using sns.factorplot like so: 或使用sns.factorplot的数据sns.factorplot所示:

g = sns.factorplot('x2', 'y', hue='x3', kind='point', data=merged)
sns.despine(offset=10);
g.savefig('tmp3.png')

在此处输入图片说明

But I do not see how to produce a plot similar to the first one (ie lines for model predictions using plt.plot , a scatter of points for data using plt.scatter ). 但我不明白如何产生类似于第一个图(即行使用模型预测plt.plot ,点使用数据分散plt.scatter )。 The reason is that the x2 variable I'm trying to use as the x-axis is a string / object, so the pyplot commands don't know what to do with them. 原因是我试图用作x轴的x2变量是一个字符串/对象,因此pyplot命令不知道该如何处理它们。

As I mention in my comments, there are two ways I would think about doing this. 正如我在评论中提到的那样,我有两种方法可以考虑这样做。

The first is to define a function that does the fit and then plots and pass it to FacetGrid.map : 首先是定义一个执行拟合的函数,然后进行绘制并将其传递给FacetGrid.map

import pandas as pd
import seaborn as sns
tips = sns.load_dataset("tips")

def plot_good_tip(day, total_bill, **kws):

    expected_tip = (total_bill.groupby(day)
                              .mean()
                              .apply(lambda x: x * .2)
                              .reset_index(name="tip"))
    sns.pointplot(expected_tip.day, expected_tip.tip,
                  linestyles=["--"], markers=["D"])

g = sns.FacetGrid(tips, col="sex", size=5)
g.map(sns.pointplot, "day", "tip")
g.map(plot_good_tip, "day", "total_bill")
g.set_axis_labels("day", "tip")

在此处输入图片说明

The second is the compute the predicted values and then merge them into your DataFrame with an additional variable that identifies what is data and what is model: 第二个是计算预测值,然后将它们与一个附加变量合并到您的DataFrame中,该变量标识什么是数据和什么是模型:

tip_predict = (tips.groupby(["day", "sex"])
                   .total_bill
                   .mean()
                   .apply(lambda x: x * .2)
                   .reset_index(name="tip"))
tip_all = pd.concat(dict(data=tips[["day", "sex", "tip"]], model=tip_predict),
                    names=["kind"]).reset_index()

sns.factorplot("day", "tip", "kind", data=tip_all, col="sex",
               kind="point", linestyles=["-", "--"], markers=["o", "D"])

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM