简体   繁体   中英

Split a data frame in R and apply function to each part

I have a big data frame with 5 columns and thousands of rows. The data frame 'd' looks like this:

Material  Input_Wt  Price
   1        10       13
   3         6       18
   1         9       12
   2        12       15
   3         4        8
   1        14       10

I need to perform regression on the data to predict the price of each material for different input weights. The regression technique to be applied depends on the number of records for the unique material no. So I need to process all the records pertaining to a unique material no. together.

So I split the data according to the material no. into multiple csv files and saved them in the working directory using the codes:

SPLIT.DATA <- split(d, d$Material, drop = FALSE)

lapply(names(SPLIT.DATA), function(nm)
write.csv(SPLIT.DATA[[nm]], paste0(nm, ".csv"), row.names = FALSE, quote = FALSE))

The files look like:

Material  Input_Wt  Price
   1         10       13
   1          9       12
   1         14       10

Material  Input_Wt  Price
   2         12       15 

Material  Input_Wt  Price
   3         6        18
   3         4         8

I then called all these files onto R in a list using:

fileNames <- Sys.glob("*.csv")

and applied the function on each of them separately and saved the output appended in a single file:

for (fileName in fileNames){
  inp = read.csv(fileName,header = TRUE,sep = ",")
  if (nrow(inp)==3){
    print(RandomForest())
  }else if (nrow(inp)==2){
    print(KNN())
  }else if (nrow(inp)==1){
    print("Insufficient Data")
  }
}

'KNN' and 'RandomForest' are separate functions which I have defined.

I ultimately get the desired output as:

Material  Input_Wt  Price Predicted_Price
   1         10       13       14.5
   1          9       12       13.8
   1         14       10        9.2
   2         12       15       16.1
   3         6        18       17.5
   3         4         8        9.7

The problem here is that this way is not efficient. I first have to split and write the data frame into multiple csv files and then call them one by one onto R to process them again.

Is there a way that this entire process can be done directly without writing the data frames to csv files and calling them again?

Your title is the essential definition of by (object-oriented wrapper of tapply ) that, unlike split , maintains a function argument. Consider defining a function that receives a data frame as parameter and call it with by .

my_func <- function(inp){
  if (nrow(inp)==3){
    obj <- RandomForest()
  }else if (nrow(inp)==2){
    obj <- KNN()
  }else if (nrow(inp)==1){
    obj <- "Insufficient Data"
  }
  print(obj)

  return(obj)
}

obj_list <- by(df, df$Material, my_func)

Don't split your dataframe, just use a subsetting statement:

df[df$Material == 1,]
subset(df, df$Material == 1)

or with package dplyr :

df %>%
  filter(Material == 1)

If you want to apply a function based on the number of entries per group try something like

df %>%
  group_by(Material) %>%
  mutate(Predicated_Price=case_when(n() == 3 ~ "RandomForest()",
                                    n() == 2 ~ "KNN()",
                                    n() == 1 ~ "Insufficient Data"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM