简体   繁体   中英

Multiple NA's for the last variables of linear regression model in R

I am trying to run a linear regression model where I have dummy variables in my data to indicate if a certain predictor variable is not present. I have a total of 15 predictor variables.

No matter the order of my predictor variables, the last five variables always result in NA .

This problem is almost exactly the same as the one asked here: linear regression "NA" estimate just for last coefficient

I tried adding -1 or +0 to the code

lm(H~id11+id21+id22+id23+id24+id31+id41+id42+id43+id52+id71+id81+id82+id90+id95, data=macro.shed)

And that resulted in only one less value being NA . So now I have 4, instead of 5, predictor variables being NA .

I am reading in my data from csv documents.

This is my code:

watershed = read.csv("nlcd_2000_watershed.csv") macro_2000 = read.csv("wapp_macro_2000.csv") temp1 = matrix(watershed$Area,ncol=15,byrow=T) nlcd_watershed = data.frame(cbind(unique(watershed$WaterID),temp1)) names(nlcd_watershed)=c("WaterID",paste("id",unique(watershed$Value),sep="")) macro.shed = merge(macro_2000,nlcd_watershed,by.x="WaterID",by.y="WaterID") data.frame(unique(watershed$Value),unique(watershed$NLCD))

This is my data for macro.shed:

dput(macro.shed)
structure(list(WaterID = c(1L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 10L, 10L, 10L, 10L, 11L), ID = structure(c(1L, 16L, 
2L, 9L, 10L, 11L, 12L, 13L, 15L, 8L, 3L, 4L, 5L, 6L, 7L, 14L), .Label = c("L1", 
"L10", "L11", "L12", "L13", "L14", "L15", "L16", "L2", "L3", 
"L4", "L5", "L6", "L7", "L8", "L9"), class = "factor"), Date = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "8/20/2001", class = "factor"), 
    UTMX = c(607308L, 607112L, 598526L, 592235L, 603094L, 597749L, 
    605523L, 608668L, 600517L, 601806L, 597548L, 593815L, 591453L, 
    607187L, 606851L, 589528L), UTMY = c(4639040L, 4643780L, 
    4622470L, 4608350L, 4629780L, 4623340L, 4634330L, 4636950L, 
    4628160L, 4630380L, 4621720L, 4611960L, 4607960L, 4636480L, 
    4636020L, 4605120L), Watershed = structure(c(1L, 1L, 2L, 
    3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 10L, 10L, 10L, 10L, 11L), .Label = c("Cold Spring Creek", 
    "Drake Brook", "Dutchess County Airport", "East Branch Wappinger", 
    "Great Spring Creek", "Grist Mill Creek", "Hunns Lake Creek", 
    "Little Wappinger", "Upton Lake Creek", "Wappinger Creek", 
    "Wappinger Falls"), class = "factor"), richness = c(37L, 
    20L, 32L, 14L, 23L, 20L, 23L, 28L, 25L, 32L, 31L, 30L, 23L, 
    33L, 19L, 19L), H = c(0.9, 1, 0.9, 0.8, 1, 0.8, 0.7, 1, 1, 
    1, 1, 1, 1, 1, 0.9, 1), EPT = c(18L, 14L, 13L, 3L, 15L, 12L, 
    15L, 19L, 15L, 21L, 17L, 16L, 13L, 20L, 13L, 12L), DOM = c(62.1, 
    61.5, 64.1, 73.7, 53.4, 74, 80.3, 59.2, 55.6, 56.8, 57.4, 
    59.4, 54.2, 59.8, 66, 52.2), PMA = c(58.1, 51, 59.3, 39.9, 
    58.4, 45.2, 54.5, 75.3, 56.2, 64.3, 66, 53.7, 55.6, 60.4, 
    52.3, 42.4), FBI = c(3.8, 3.4, 4, 3.9, 3.6, 4.2, 5.2, 3.8, 
    3.5, 4.1, 3.7, 3.7, 4, 3.8, 3.5, 3.6), BAP = c(8.3, 6.8, 
    7.8, 3.9, 7.4, 6, 6.8, 8.4, 7.5, 8.2, 8.3, 7.8, 6.8, 8.3, 
    6.6, 6), Insects.sample = c(7123L, 516L, 2061L, 1341L, 921L, 
    961L, 580L, 1567L, 1180L, 4226L, 4133L, 1400L, 2325L, 2596L, 
    687L, 609L), id11 = c(216900L, 216900L, 4923900L, 131400L, 
    1806300L, 0L, 41945400L, 250200L, 200700L, 1908000L, 4500L, 
    4500L, 4500L, 4500L, 4500L, 25427700L), id21 = c(83700L, 
    83700L, 1163700L, 1290600L, 0L, 0L, 11841300L, 2824200L, 
    110700L, 136800L, 9000L, 9000L, 9000L, 9000L, 9000L, 9145800L
    ), id22 = c(111600L, 111600L, 596700L, 7245000L, 63900L, 
    11700L, 7293600L, 5060700L, 323100L, 179100L, 55800L, 55800L, 
    55800L, 55800L, 55800L, 3876300L), id23 = c(413100L, 413100L, 
    611100L, 1817100L, 0L, 0L, 11107800L, 208800L, 1713600L, 
    33300L, 204300L, 204300L, 204300L, 204300L, 204300L, 6268500L
    ), id24 = c(239400L, 239400L, 4547700L, 193500L, 26100L, 
    10800L, 48636900L, 88200L, 1139400L, 41400L, 16200L, 16200L, 
    16200L, 16200L, 16200L, 14818500L), id31 = c(63900L, 63900L, 
    14319000L, 526500L, 139500L, 0L, 58785300L, 398700L, 1723500L, 
    73800L, 0L, 0L, 0L, 0L, 0L, 31161600L), id41 = c(384300L, 
    384300L, 4142700L, 0L, 86400L, 0L, 9641700L, 357300L, 3166200L, 
    392400L, 0L, 0L, 0L, 0L, 0L, 963900L), id42 = c(729000L, 
    729000L, 508500L, 209700L, 13500L, 0L, 4072500L, 682200L, 
    2137500L, 31500L, 10800L, 10800L, 10800L, 10800L, 10800L, 
    3993300L), id43 = c(1224000L, 1224000L, 1266300L, 1532700L, 
    0L, 418500L, 6607800L, 695700L, 1356300L, 10800L, 78300L, 
    78300L, 78300L, 78300L, 78300L, 5419800L), id52 = c(16200L, 
    16200L, 57600L, 600300L, 17100L, 0L, 1730700L, 958500L, 120600L, 
    101700L, 20700L, 20700L, 20700L, 20700L, 20700L, 0L), id71 = c(22500L, 
    22500L, 780300L, 208800L, 5400L, 0L, 1139400L, 533700L, 7085700L, 
    582300L, 0L, 0L, 0L, 0L, 0L, 198000L), id81 = c(221400L, 
    221400L, 3398400L, 0L, 1649700L, 0L, 287100L, 155700L, 6300900L, 
    1511100L, 13500L, 13500L, 13500L, 13500L, 13500L, 264600L
    ), id82 = c(665100L, 665100L, 1513800L, 41400L, 447300L, 
    0L, 3083400L, 132300L, 616500L, 53100L, 2943900L, 2943900L, 
    2943900L, 2943900L, 2943900L, 931500L), id90 = c(2142000L, 
    2142000L, 826200L, 215100L, 0L, 17705700L, 630000L, 1156500L, 
    590400L, 15300L, 4598100L, 4598100L, 4598100L, 4598100L, 
    4598100L, 311400L), id95 = c(4628700L, 4628700L, 113400L, 
    4897800L, 0L, 10526400L, 358200L, 2281500L, 1431900L, 33300L, 
    4982400L, 4982400L, 4982400L, 4982400L, 4982400L, 0L)), .Names = c("WaterID", 
"ID", "Date", "UTMX", "UTMY", "Watershed", "richness", "H", "EPT", 
"DOM", "PMA", "FBI", "BAP", "Insects.sample", "id11", "id21", 
"id22", "id23", "id24", "id31", "id41", "id42", "id43", "id52", 
"id71", "id81", "id82", "id90", "id95"), row.names = c(NA, -16L
), class = "data.frame")

How do I make it so that the last variables are not resulting in NA s?

You're trying to fit 14 predictors (15 if you include an intercept) with only 16 observations.

That's not enough data to calculate that many parameters, which is why you're only getting estimates for some of them.

You'll need to use some sort of regularisation or model selection, but even then your estimates will be sensitive to the method you choose.

To add on to the answer provided by @Pete, many of your variables have high collinearity. To visualize this easily,

library(corrplot)

corPlot <- cor( macro.shed[, c(15:29)])

corPlot <- cor(x)
corrplot(corPlot, method = "number")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM