calculate the mean of different subsets based on the same ID in a dataframe and store the results in a new dataframe in R

Question

My dataframe (hh_dist_points) has the following structure:

hh_dist_points <- read.table(header=TRUE ,text="
  hhid VillageID hhid_1 VillageI_1    NEAR_DIST
  2739 405050508   2730  405050508 8.300739e+01
  2739 405050508   2588  405050508 9.717326e+01
  2739 405050508   2825  405050508 1.335821e+02
  2739 405050508   2823  405050508 1.631118e+02
  2739 405050508   2729  405050508 1.964680e+02
  2739 405050508   2810  405050508 2.243312e+02
  2739 405050508   2828  405050508 2.889768e+02
  2739 405050508   2725  405050502 8.808605e+02
  2739 405050508   2822  405050502 9.084585e+02
  2739 405050508   2731  405050502 9.222313e+02
  2739 405050508   2742  405050502 9.681594e+02
  2739 405050508   2741  405050502 1.026474e+03")

The original dataset containts ca. 2000 observations (1 observation = a house in a village (hhid). Houses which belong to the same village have the same VillageID (ca. 10 observations with the same ID). Near_Dist: geodetical distance between 2 houses (hhid) The dataframe above shows the distance of each house (hhid) to all other houses in my dataset (hhid_1) (together over 3 Mio. rows).

My objective: Calculate the mean of Near_Dist for yeach group of observations (hhid) based on the same VillageID and store the result in a new dataframe:

VillageID   dist_mean
405050508   963,257416
405050502   823,21464
.....       .........

General idea: If VillageID = VillageID_1 then calculate the mean of Near_Dist and store the result in a new dataframe.

My idea was to use a loop:

if(hh_dist_points$VillageID = hh_dist_points$VillageI_1) {
hh_dist_new <- mean(hh_dist$NEAR_DIST)
}
else

But I know this isn´t correct (and unfinished) but I don´t know how to finish it. Any ideas how to simply solve this problem? (maybe without using loops). I tried to search for any answers and solutions but I haven´t found any.

I need the dataframe for other calculations. Many thanks four your help.

Answer 1

Although you can do it in R base, it easy to do it with data.table

library(data.table)


hh_dist_points <- read.table(header=TRUE ,text="
      hhid VillageID hhid_1 VillageI_1    NEAR_DIST
      2739 405050508   2730  405050508 8.300739e+01
      2739 405050508   2588  405050508 9.717326e+01
      2739 405050508   2825  405050508 1.335821e+02
      2739 405050508   2823  405050508 1.631118e+02
      2739 405050508   2729  405050508 1.964680e+02
      2739 405050508   2810  405050508 2.243312e+02
      2739 405050508   2828  405050508 2.889768e+02
      2739 405050508   2725  405050502 8.808605e+02
      2739 405050508   2822  405050502 9.084585e+02
      2739 405050508   2731  405050502 9.222313e+02
      2739 405050508   2742  405050502 9.681594e+02
      2739 405050508   2741  405050502 1.026474e+03")


dt <- data.table(hh_dist_points)
dt[VillageID==VillageI_1,mean(NEAR_DIST,na.rm=TRUE),.(VillageID)]

#  VillageID       V1
# 1: 405050508 169.5215

Answer 2

If I understand you right this will work:

require(dplyr)
newDF<- hh_dist_points%>%
    group_by(VillageID, Village_I1)%>%
    summarize(average=mean(NEAR_DIST))

This will create a new data frame called newDF with your VillageID and VIllageI1 columns, then add a column named average with the mean of the values in NEAR_DIST for each village_ID and Village_I1 combination.

Then you can use:

finalDF<- newDF[newDF$Village_ID == newDF$Village_I1,]

This will subset keeping only the rows where the two columns of ID's are equal. it. This keeps you out of a loop and is pretty fast and easy to understand the logic.

If I misunderstood you and that is not what you are looking for, shoot me a comment explaining how, and I will refine the answer.

Answer 3

You could try something like this:

 require(dplyr)

 new_data  <- hh_dist_points %>%
  filter_("VillageID == VillageI_1") %>%
  group_by(VillageID) %>%
  summarise(dist_mean=mean(NEAR_DIST, na.rm = TRUE))

calculate the mean of different subsets based on the same ID in a dataframe and store the results in a new dataframe in R

Question

3 answers

solution1
0 2017-05-20 19:11:22

solution2
0 2017-05-20 19:13:04

solution3
0 ACCPTED 2017-05-20 22:26:47

calculate the mean of different subsets based on the same ID in a dataframe and store the results in a new dataframe in R

Question

3 answers

solution1 0 2017-05-20 19:11:22

solution2 0 2017-05-20 19:13:04

solution3 0 ACCPTED 2017-05-20 22:26:47

solution1
0 2017-05-20 19:11:22

solution2
0 2017-05-20 19:13:04

solution3
0 ACCPTED 2017-05-20 22:26:47