简体   繁体   中英

How to sum values in a column associated with duplicates in another column in R?

Working in R. I have a data frame Taxi rides that I've sorted by vehicle id and then time stamp (earliest to latest) so that it looks as below:

Taxi = arrange(Taxi, Taxi$vehicle_id, Taxi$timestamp)

Image of Data frame example in Excel

在此输入图像描述

You will notice the "distance" column covers distances between each ride for every taxi vehicle. I would like to utilize a formula that sums the distances for each Taxi. Likewise I would repeat this process on the time stamp subtracting the latest date from the earliest date to calculate the time on the road for each vehicle. My hope is to use both the sum distance and time difference to calculate speed for each taxi. So the final product would be a matrix that looks like the following (pending conversion to meters and seconds):

Image of final product Example in Excel

在此输入图像描述

I believe a for loop in R would be useful here, identifying each vehicle id, running through the associated values for distance and then summing them, but I'm unsure where to start. I started with a data set of unique Taxi vehicle Ids to help R identify them. Below is what i have so far for summing the distances; I'm not sure how to tell R to sum each vehicle ID range:

Taxi.uniques = unique(Taxi$vehicle_id)
for (i in Taxi) {
    look.at = Taxi$vehicle_id == i 
    sum(all distances for a vehicle ID)
} 

And here is what i have for subtracting the latest timestamp for each vehicle by it's earliest timestamp:

Taxi.uniques = unique(Taxi$vehicle_id)
for (i in Taxi) {
    look.at = Taxi$vehicle_id == i 
    max(Taxi$timestamp[look.at]) - min(Taxi$timestamp[look.at]) 
}

Not sure If I'm on the right track and would appreciate any help!

If you aren't opposed to using data.table, you could do this in a single step.

require(data.table)
taxi<-data.table(vehicleID=c('Taxi1','Taxi2','Taxi3','Taxi1','Taxi2','Taxi3'),
                 timestamp=Sys.time()+sample(60:600,6),
                 distance=sample(1:50,6))

taxi[vehicleID %in% c('Taxi1','Taxi3'),list(TimeonRoad=max(timestamp)-min(timestamp),
           Distance=sum(distance)),
     by='vehicleID']

data.table structure is DT[i, j, by] . i is used for subsetting by condition, j is used for running different operations on the variables or selecting columns, by is for grouping. Adding the subset to i will get you what you need.

Here's a solution using dplyr (much more readable, IMHO)

library(dplyr)

taxi %>%
  group_by(vehicleID) %>%
  summarize(
    time = max(timestamp) - min(timestamp),
    distance = sum(distance)
  )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM