My dataframe (hh_dist_points) has the following structure:
hh_dist_points <- read.table(header=TRUE ,text="
hhid VillageID hhid_1 VillageI_1 NEAR_DIST
2739 405050508 2730 405050508 8.300739e+01
2739 405050508 2588 405050508 9.717326e+01
2739 405050508 2825 405050508 1.335821e+02
2739 405050508 2823 405050508 1.631118e+02
2739 405050508 2729 405050508 1.964680e+02
2739 405050508 2810 405050508 2.243312e+02
2739 405050508 2828 405050508 2.889768e+02
2739 405050508 2725 405050502 8.808605e+02
2739 405050508 2822 405050502 9.084585e+02
2739 405050508 2731 405050502 9.222313e+02
2739 405050508 2742 405050502 9.681594e+02
2739 405050508 2741 405050502 1.026474e+03")
The original dataset containts ca. 2000 observations (1 observation = a house in a village (hhid). Houses which belong to the same village have the same VillageID (ca. 10 observations with the same ID). Near_Dist: geodetical distance between 2 houses (hhid) The dataframe above shows the distance of each house (hhid) to all other houses in my dataset (hhid_1) (together over 3 Mio. rows).
My objective: Calculate the mean of Near_Dist for yeach group of observations (hhid) based on the same VillageID and store the result in a new dataframe:
VillageID dist_mean
405050508 963,257416
405050502 823,21464
..... .........
General idea: If VillageID = VillageID_1 then calculate the mean of Near_Dist and store the result in a new dataframe.
My idea was to use a loop:
if(hh_dist_points$VillageID = hh_dist_points$VillageI_1) {
hh_dist_new <- mean(hh_dist$NEAR_DIST)
}
else
But I know this isn´t correct (and unfinished) but I don´t know how to finish it. Any ideas how to simply solve this problem? (maybe without using loops). I tried to search for any answers and solutions but I haven´t found any.
I need the dataframe for other calculations. Many thanks four your help.
Although you can do it in R base, it easy to do it with data.table
library(data.table)
hh_dist_points <- read.table(header=TRUE ,text="
hhid VillageID hhid_1 VillageI_1 NEAR_DIST
2739 405050508 2730 405050508 8.300739e+01
2739 405050508 2588 405050508 9.717326e+01
2739 405050508 2825 405050508 1.335821e+02
2739 405050508 2823 405050508 1.631118e+02
2739 405050508 2729 405050508 1.964680e+02
2739 405050508 2810 405050508 2.243312e+02
2739 405050508 2828 405050508 2.889768e+02
2739 405050508 2725 405050502 8.808605e+02
2739 405050508 2822 405050502 9.084585e+02
2739 405050508 2731 405050502 9.222313e+02
2739 405050508 2742 405050502 9.681594e+02
2739 405050508 2741 405050502 1.026474e+03")
dt <- data.table(hh_dist_points)
dt[VillageID==VillageI_1,mean(NEAR_DIST,na.rm=TRUE),.(VillageID)]
# VillageID V1
# 1: 405050508 169.5215
If I understand you right this will work:
require(dplyr)
newDF<- hh_dist_points%>%
group_by(VillageID, Village_I1)%>%
summarize(average=mean(NEAR_DIST))
This will create a new data frame called newDF with your VillageID and VIllageI1 columns, then add a column named average with the mean of the values in NEAR_DIST for each village_ID and Village_I1 combination.
Then you can use:
finalDF<- newDF[newDF$Village_ID == newDF$Village_I1,]
This will subset keeping only the rows where the two columns of ID's are equal. it. This keeps you out of a loop and is pretty fast and easy to understand the logic.
If I misunderstood you and that is not what you are looking for, shoot me a comment explaining how, and I will refine the answer.
You could try something like this:
require(dplyr)
new_data <- hh_dist_points %>%
filter_("VillageID == VillageI_1") %>%
group_by(VillageID) %>%
summarise(dist_mean=mean(NEAR_DIST, na.rm = TRUE))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.