R Error in names(x) <- value depending on range box plots in a loop

Question

I have a large dataset with 270 columns and 17392 rows. Of those 270, I need to select 235. The rows can be grouped by 'Site', which is a unique numeric value (eg, 1, 2 etc - 111 different Sites in total). Each of the column constitute a 'region'. Here is a small example (please note that the columns and subjects are way more):

SubjID LLatVent RLatVent FullSurfArea Site
Subj1  1580.6    2345      180980      1
Subj2  4803.8    2232      210003      1
Subj3  14936     1456      198045      2
Subj4  14556     1200      176079      2

My goal is to calculate the number of outliers per region, grouped by Site, and print a csv file with the result. My code works if I use 1.5*IQR, but I get an error if I use 2.5*IQR, and I don't understand why. The error is:

Error in names(x) <- value: 'names' attribute [235] must be the same length as the vector [1]

My attempt of code (that fails):

#start
ALL <- read.csv("ALL.csv")

#get rows of interest (235)

start <- which(colnames(ALL)=="LLatVent")
end <- which(colnames(ALL)=="FullSurfArea")

#create vector with these row numbers

regions <- start:end

#divide by site (111 sites in total)

  df_list <- split(ALL, as.factor(ALL$Site))

  #loop through regions and save subjID in ALL frame outliers_subjID

  for (j in df_list){
    outliers_subjID_list <- list()
    count <- 0
    for (i in regions){
    count <- count + 1
    OutVals <- boxplot(j[,i], plot=FALSE, range=2.5)$out
    outliers_subjID_list[[count]] <- j$SubjID[which(j[,i] %in% OutVals)]
  }
  n.obs <- sapply(outliers_subjID_list, length)
  seq.max <- seq_len(max(n.obs))
  outliers_subjID <- as.data.frame(sapply(outliers_subjID_list, "[", i = seq.max))
  colnames(outliers_subjID) <- colnames(j)[regions]

#write csv files

    write.csv(outliers_subjID, paste0(unique(j$Site), ".csv"))
  }

Why do I get an error when I use range=2.5? The same happens if I use boxplot.stats(as.matrix(j[,i]), coef=2.5)$out.

Also, I want to calculate the total number of outliers per region, after they have been calculated by Site. At the moment I am binding all the csv files and then using summarise_all to calculate the number of observation per region, but I feel like there is a smarter way.

Many thanks in advance, please let me know if I can provide more information.

Answer 1

It's probably because some regions don't have any large outlier as defined by 2.5 times the IQR. You could prevent the error by bypassing the line that causes the error with an if statement.

for (i in regions){
  count <- count + 1 # maybe move this
  OutVals <- boxplot(j[,i], plot=FALSE, range=2.5)$out

  if(length(OutVals)>0)  # <-- add this line
    outliers_subjID_list[[count]] <- j$SubjID[which(j[,i] %in% OutVals)]
}

Since I don't have your data, it's impossible to test. You may need to modify the code slightly. For example, the count may need to be moved to within the if statement.

R Error in names(x) <- value depending on range box plots in a loop

Question

1 answers

solution1
0 2020-06-19 02:12:30

R Error in names(x) <- value depending on range box plots in a loop

Question

1 answers

solution1 0 2020-06-19 02:12:30

solution1
0 2020-06-19 02:12:30