简体   繁体   English

如何在 R 中为 lm() 保留 fit$model 中的变量,我*不*在 lm 调用本身中使用它?

[英]How to keep a variable in fit$model for lm() in R that I'm *not* using within the lm call itself?

I want to be able to index my model after having fit the model. Say I have我希望能够在适合 model之后为我的 model 编制索引。假设我有

df <- data.frame(a = c(1,2,3), 
                 b = c(2,3,1000), 
                 country = c("Malawi", "USA","UK"))

Then, I run:然后,我跑:

fit<-lm(a~b,data=df)

My resulting fit$model no longer has the "country" variable, so it becomes hard to do things like我得到的fit$model不再有“国家”变量,所以很难做这样的事情

  • run a regression and then remove certain countries as robustness tests.运行回归,然后删除某些国家作为稳健性测试。
  • run a regression and then find out which countries were outliers.运行回归,然后找出哪些国家是异常值。

I know there are 'hacks' around this like using row indices, but I frequently find myself further subsetting the original dataset, and I am afraid of keeping track of row indices.我知道这有一些“技巧”,比如使用行索引,但我经常发现自己进一步对原始数据集进行子集化,而且我害怕跟踪行索引。

eg From the example above, I see that UK is an outlier.例如从上面的例子中,我看到英国是一个离群值。

So, I have two options:所以,我有两个选择:

lm(a~b,data=fit$model[-3,])
lm(a~b,data=df[df$country!="UK",])

The second option is much clearer to me, but because summary statistics and tests in R (such as cook's distance) only give me the row index , I end up having to do the first option much more than I would like.第二个选项对我来说更清楚,但是因为 R 中的汇总统计和测试(例如 cook 的距离)只给我 row index ,我最终不得不做第一个选项比我想要的更多。 This becomes especially tedious in large panel datasets where I'm trying to test robustness to outliers or leveraged data and would also like to know what countries (or other variables) those data are.这在我试图测试对异常值或杠杆数据的稳健性并且还想知道这些数据是哪些国家(或其他变量)的大型面板数据集中变得特别乏味。

Ideally, I'd like an option to do something like理想情况下,我想要一个选项来做类似的事情

lm(a~b,data=fit$model[fit$model$country!="UK",])

Please help, and thank you so much!请帮忙,非常感谢!

I am assuming the problem is to identify the rows of the original data frame from an lm model that was run on a subset of those rows where the subset run was performed without using all columns.我假设问题是从lm model 中识别原始数据帧的行,该 lm model 在不使用所有列的情况下执行子集运行的那些行的子集上运行。

Regarding the characterization of row names in the question, I would not regard their use as negative at all.关于问题中行名称的特征,我根本不会认为它们的使用是负面的。 The row names are an intrinsic part of every data frame and are intended to identify the rows.行名是每个数据框的固有部分,用于标识行。 If you do identify the rows with case names these case names will be shown by many functions including case.names(fm), model.frame(fm), model.matrix(fm), cooks.distance(fm), hatvalues(fm), influence(fm), plot(fm), etc. so it is highly desirable that they be used.如果您确定具有案例名称的行,这些案例名称将由许多函数显示,包括 case.names(fm)、model.frame(fm)、model.matrix(fm)、cooks.distance(fm)、hatvalues(fm) )、influence(fm)、plot(fm) 等,因此非常希望使用它们。 This is really the way this software was intended to work so it is highly advisable to go with the case names approach to simplify everything.这确实是该软件旨在工作的方式,因此强烈建议 go 使用案例名称方法来简化一切。

1) Thus if the country names are unique identifiers of the cases then they can be maintained as case names by simply assigning them to the row names. 1)因此,如果国家/地区名称是案例的唯一标识符,则可以通过简单地将它们分配给行名称来将它们作为案例名称进行维护。 We omitted USA to make the example harder owing to the fact that it does not come at the end like UK does and if we used UK maybe it would just give the first two case names anyways.我们省略了USA以使示例更难,因为它不像UK那样放在最后,如果我们使用UK ,它可能只会给出前两个案例名称。

df <- data.frame(a = c(1,2,3),  b = c(2,3,1000), country = c("Malawi", "USA","UK"))

rownames(df) <- df$country

fm <- lm(a ~ b, df)
fm2 <- update(fm, subset = country != "USA")  # omit USA
# or:  update(fm, subset = case.names(fm) != "USA")

case.names(fm2)
## [1] "Malawi" "UK"    

2) Although (1) seems preferable another possibility that works even if we don't assign the country column to the row names is to look up the case names in the original data frame: 2)尽管 (1) 似乎更可取,但即使我们不将国家列分配给行名称,另一种可行的可能性是在原始数据框中查找案例名称:

df <- data.frame(a = c(1,2,3),  b = c(2,3,1000), country = c("Malawi", "USA","UK"))

fm <- lm(a ~ b, df)
fm2 <- update(fm, subset = country != "USA")  # omit USA

df[ case.names(fm2), ]
##   a    b country
## 1 1    2  Malawi
## 3 3 1000      UK

or as a function:或作为 function:

# first arg is lm object
# second arg is full data frame - data frame used in lm call if unspecified
# third arg is envir where full data frame stored - current envir if unspecified
extractData <- function(mod, data, envir = parent.frame()) {
  if (missing(data)) data <- eval(mod$call$data, envir)
  data[ case.names(mod), ]
}

# test

extractData(fm2)
##   a    b country
## 1 1    2  Malawi
## 3 3 1000      UK

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM