AIM: I want to automate (loop) the code below, without having to manually run it for each sample. I have a terrible habit of writing long-hand in base, and need to start using loops, which I find difficult to implement.
DATA: I have two data frames: one of the sample data ( samples ), and one of reference data ( ref ). They both contain the same variables ( x, y, z ).
CODE DESCRIPTION: For each sample (sample$sample_name), I want to calculate it's Euclidean distance to each case in the reference data. The results are then used to re-order the reference data, to show which points are 'closest' to the sample data point, in the Euclidean (3-dimensional) space.
My current code allows me to simply substitute the sample name (ie "s1") and then re-run the code, making one final change for the filename of the .csv file. The output is a list of the reference data in order of closest proximity to the sample (in the Euclidean space).
I would like to automate the process (into a loop?), so that I can simply run it on the two data frames using the list of sample names (samples$sample_name), and hopefully also automate the exporting to a .csv file.
Any help would be greatly appreciated!
# Reference data
country<-c("Austria","Austria","Italy","Italy","Turkey","Romania","France")
x<-c(18.881,18.881,18.929,19.139,19.008,19.083,18.883)
y<-c(15.627,15.627,15.654,15.772,15.699,15.741,15.629)
z<-c(38.597,38.597,38.842,39.409,39.048,39.224,38.740)
pb_age<-c(-106,-106,-87,-6,-55,-26,-104)
ref<-data.frame(country,x,y,z,pb_age) # Reference data
# Sample data (for euclidean measurements against Reference data)
sample_name<-c("s1","s2","s3")
x2<-c(18.694,18.729,18.731)
y2<-c(15.682,15.683,15.677)
z2<-c(38.883,38.989,38.891)
pb_age2<-c(120,97,82)
samples<-data.frame(sample_name,x2,y2,z2,pb_age2) # Sample data
colnames(samples)<-c("sample_name","x","y","z","pb_age") # To match Reference data headings
# Euclidean distance measurements
library(fields) # Need package for Euclidean distances
# THIS IS WHAT I WANT TO AUTOMATE/LOOP (BELOW)...
# Currently, I have to update the 'id' for each sample to get a result (for each sample)
id<-"s1" # Sample ID - this is simply changed so the following code can be re-run for each sample
# The code
x1<-samples[which(samples$sample_name==id),c("x","y","z")]
x2<-ref[,c("x","y","z")]
result_distance<-rdist(x1,x2) # Computing the Euclidean distance
result_distance<-as.vector(result_distance) # Saving the results as a vector
euclid_ref<-data.frame(result_distance,ref) # Creating a new data.frame adding the Euclidean distances to the original Reference data
colnames(euclid_ref)[1]<-"euclid_distance" # Updating the column name for the result
# Saving and exporting the results
results<-euclid_ref[order(euclid_ref$euclid_distance),] # Re-ordering the data.frame by the euclide distances, smallest to largest
write.csv(results, file="s1.csv") # Ideally, I want the file name to be the same as the SAMPLE id, i.e. s1, s2, s3...
A loop would be simple enough, but a more R-like solution would be to take advantage of vectorization and the apply-family of functions:
result_distances <- data.frame(t(rdist(samples[, 2:4], ref[, 2:4])), ref)
colnames(result_distances)[1:3] <- rep("euclid_distance", 3)
# str(result_distances)
# 'data.frame': 7 obs. of 8 variables:
# $ euclid_distance: num 0.346 0.346 0.24 0.695 0.355 ...
# $ euclid_distance: num 0.424 0.424 0.25 0.594 0.286 ...
# $ euclid_distance: num 0.334 0.334 0.205 0.666 0.319 ...
# $ country : chr "Austria" "Austria" "Italy" "Italy" ...
# $ x : num 18.9 18.9 18.9 19.1 19 ...
# $ y : num 15.6 15.6 15.7 15.8 15.7 ...
# $ z : num 38.6 38.6 38.8 39.4 39 ...
# $ pb_age : num -106 -106 -87 -6 -55 -26 -104
Normally we would not give multiple columns the same name, but we are planning to pull them out next:
results <- lapply(1:3, function(i) data.frame(result_distances[order(result_distances[, i]), c(i, 4:8)]))
names(results) <- samples$sample_name
Now we have a list called results
with three data frames named "s1", "s2", and "s3". Lists make it easy to apply a function to many sets of similarly organized data. For example results[["s1"]]
or results[[1]]
prints the data frame for the first sample. Now we write out the results:
sapply(names(results), function(x) write.csv(results[[x]], file=paste0(x, ".csv")))
This will create 3 files, "s1.csv", "s2.csv", "s3.csv".
Here's a loop that computes Euclidean distances for all samples from the reference data locations, using your original input data and the key parts of your code. It's a little more verbose than the vectorised-apply solution, but perhaps is a bit easier to read because it is less terse and nested. The final output is a single data frame.
# prepare an empty list object to store the results
output <- vector("list", length = nrow(samples))
# this is the start of the loop
for(i in seq_len(nrow(samples))){
# we can read this as 'for row i of the samples dataframe, do this...'
# get coords for sample i
sample_coords <- samples[i ,c("x","y","z")]
# get coords for all reference locations
# this line would be fine above the loop
# since it gives the same result for each
# iteration. I place it here to echo your
# original workflow
ref_coords <- ref[,c("x","y","z")]
# compute Euclidean distance and coerce to vector,
e_dist_vec <- as.vector(rdist(sample_coords, ref_coords))
# store in data frame
e_dist_ref_df <- data.frame(e_dist_vec, ref)
# update colname
colnames(e_dist_ref_df)[1] <- "euclid_distance"
# order df by euclid_distance values
results <- e_dist_ref_df[order(e_dist_ref_df$euclid_distance),]
# store results for sample i in the list
output[[i]] <- results
} # this is the end of the loop
# assign sample names to list items
names(output) <- samples$sample_name
At this point we have a list of data frames (one per sample), which you could write to individual CSVs (like we see in @dcarlson's answer), one file per data frame, or we can continue an put them all in one data frame for downstream analysis, etc. Here's how the list output from the loop looks:
> output
$s1
euclid_distance country x y z pb_age
3 0.2401874 Italy 18.929 15.654 38.842 -87
7 0.2428559 France 18.883 15.629 38.740 -104
1 0.3461069 Austria 18.881 15.627 38.597 -106
2 0.3461069 Austria 18.881 15.627 38.597 -106
5 0.3551197 Turkey 19.008 15.699 39.048 -55
6 0.5206563 Romania 19.083 15.741 39.224 -26
4 0.6948388 Italy 19.139 15.772 39.409 -6
$s2
euclid_distance country x y z pb_age
3 0.2499000 Italy 18.929 15.654 38.842 -87
5 0.2856186 Turkey 19.008 15.699 39.048 -55
7 0.2977129 France 18.883 15.629 38.740 -104
1 0.4241509 Austria 18.881 15.627 38.597 -106
2 0.4241509 Austria 18.881 15.627 38.597 -106
6 0.4288415 Romania 19.083 15.741 39.224 -26
4 0.5936506 Italy 19.139 15.772 39.409 -6
$s3
euclid_distance country x y z pb_age
3 0.2052657 Italy 18.929 15.654 38.842 -87
7 0.2195655 France 18.883 15.629 38.740 -104
5 0.3191583 Turkey 19.008 15.699 39.048 -55
1 0.3338203 Austria 18.881 15.627 38.597 -106
2 0.3338203 Austria 18.881 15.627 38.597 -106
6 0.4887627 Romania 19.083 15.741 39.224 -26
4 0.6661929 Italy 19.139 15.772 39.409 -6
Often it's convenient to have it in a single data frame for further analysis, here's one way to do that:
# bind list dfs into one big data frame, not sure what the one-line equivalent in base R is
output_df <- dplyr::bind_rows(output, .id = "sample_id")
Here's how the final product looks:
> output_df
sample_id euclid_distance country x y z pb_age
1 s1 0.2401874 Italy 18.929 15.654 38.842 -87
2 s1 0.2428559 France 18.883 15.629 38.740 -104
3 s1 0.3461069 Austria 18.881 15.627 38.597 -106
4 s1 0.3461069 Austria 18.881 15.627 38.597 -106
5 s1 0.3551197 Turkey 19.008 15.699 39.048 -55
6 s1 0.5206563 Romania 19.083 15.741 39.224 -26
7 s1 0.6948388 Italy 19.139 15.772 39.409 -6
8 s2 0.2499000 Italy 18.929 15.654 38.842 -87
9 s2 0.2856186 Turkey 19.008 15.699 39.048 -55
10 s2 0.2977129 France 18.883 15.629 38.740 -104
11 s2 0.4241509 Austria 18.881 15.627 38.597 -106
12 s2 0.4241509 Austria 18.881 15.627 38.597 -106
13 s2 0.4288415 Romania 19.083 15.741 39.224 -26
14 s2 0.5936506 Italy 19.139 15.772 39.409 -6
15 s3 0.2052657 Italy 18.929 15.654 38.842 -87
16 s3 0.2195655 France 18.883 15.629 38.740 -104
17 s3 0.3191583 Turkey 19.008 15.699 39.048 -55
18 s3 0.3338203 Austria 18.881 15.627 38.597 -106
19 s3 0.3338203 Austria 18.881 15.627 38.597 -106
20 s3 0.4887627 Romania 19.083 15.741 39.224 -26
21 s3 0.6661929 Italy 19.139 15.772 39.409 -6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.