简体   繁体   中英

`residuals(lm.fit)` in R seems to indicate incorrect number of rows in dataframe

I'm analyzing the AutoMPG dataset in R, a dataset that is available in the ISLR package and also on the UC Irvine repository .

When I run residuals(lm.fit) , the output seems to indicate that there are 397 rows, but dim(Auto) and summary(Auto) both tell me there are only 392 rows.

Can anyone explain to me why this is the case and what this means? Is it an error in my code?

    install.packages('ISLR')
    library(ISLR)
    dim(Auto) # 392 9
    str(Auto) # 'data.frame': 392 obs. of 9 variables: ...
    Auto$origin = as.factor(Auto$origin)

    # I use the lm() function to perform a simple 
    # linear regression with mpg as the response 
    # and horsepower as the predictor.
    lm.fit <- lm(mpg~horsepower, data=Auto)
    lm.fit # gives the coefficients as expected
    summary(lm.fit) # gives residuals, etc. as expected 

    # Here's where my question arises. I decide to 
    # explore the residuals:
    residuals(lm.fit) 

    # It outputs what looks to be a list of residuals 
    # from 1 to 397. But the Autos df is actually only 
    # 392 rows.
    1
    -1.41604568519558
    2
    1.10851998218221
    ...
    396
    0.533872913768169
    397
    4.00740711382913

My first guess, since there seem to be 5 extra rows, is that the first 5 values are Min 1Q Median 3Q Max. But this is not the case.

To explore, I paste the list into Excel and it turns out there are only 392 rows, even though the first is labeled 1 and the last is labeled 397.

After analyzing the output more closely I see that the results skip over 33, 127, 331, 337, and 355. That is, the 33rd observation is labeled 34, the 127th is labeled 129, and so on, thus ending at 397 instead of 392.

Can anyone explain to me why this is the case and what this means? Is it an error in my code?

> dim(Auto)
[1] 392   9
> length(residuals(lm.fit))
[1] 392

No error, just the row-names don't line up with their indices. I suspect that there may have been some NA s in those rows or something, as there are only complete cases in the dataset:

> sum(!complete.cases(Auto))
[1] 0

This is more or less confirmed if you look at this Kaggle link , you'll see ? in rows that you specified.

Read the documentation:

Description: Gas mileage, horsepower, and other information for 392 vehicles.

The orginal data contained 408 observations but 16 observations with missing values were removed.

Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

There's a useful lesson here. You should always study the nature of your data before doing any analysis. The link to the Kaggle page has some hilarious errors, including the number of rows as well as misspelling the name of the ISLR package. So verify, verify, verify. (Don't even trust.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM