简体   繁体   English

对数据帧的相等值/行求和/计数

[英]Summing/Counting equal values/rows of a data frame

I have a dataframe that contains a bunch of individual trip data, with a start and end station ID for each trip.我有一个 dataframe 包含一堆单独的行程数据,每次行程都有一个起点和终点站 ID。

I'm trying to make a second dataframe that is all the info rearranged for each station.我正在尝试制作第二个 dataframe ,这是为每个站点重新排列的所有信息。 So for example, if there are 50 trips with start_station_id == 12 in the first dataframe, then the second dataframe under station_id 12 would have its "starts" column equal to 50因此,例如,如果在第一个 dataframe 中有 50 个行程 start_station_id == 12,那么在 station_id 12 下的第二个 dataframe 的“开始”列将等于 50

Currently I figured a for loop would be the best method for this, but I cant seem to crack it目前我认为for循环将是最好的方法,但我似乎无法破解它

for(i in range(station_ids)){
   stationData$starts[i] <- sum(data$start_station_id[i] == station_ids[i])
}

This produces the following error:这会产生以下错误:

Error in `$<-.data.frame`(`*tmp*`, starts, value = c(0, 0, 0, 0, 0, 0,  : 
  replacement has 370 rows, data has 369

station_id's is a variable that contains each unique station id #, stationData$starts is where I want the number of starts stored. station_id's 是一个包含每个唯一站 id # 的变量,stationData$starts 是我想要存储开始次数的位置。 Data is the original data I am trying to run the for loop over.数据是我试图运行 for 循环的原始数据。

Is there an easier way to complete this operation or am I just writing the for loop wrong?有没有更简单的方法来完成这个操作,还是我只是写错了 for 循环? Any tips would be super helpful任何提示都会非常有帮助

From what I have understood from your question you're trying to count the occurences of each station_id this can be easily achieved by the table function which returns a table object ie a named vector containing the count and station_id as the names.根据我从您的问题中了解到的情况,您正在尝试计算每个station_id的出现次数,这可以通过table function 轻松实现,该表返回table object 即包含计数和station_id作为名称的命名向量。

base R底座R

table(data$start_station_id)
data.frame(table(data$start_station_id)) #if you prefer the data.frame look

If you want to have the number of occurence merged into your old data.frame can inner join the two data.frames using the merge function如果您想将出现次数合并到旧数据中,可以使用merge data.frame内部连接两个data.frames

tbl.df <-data.frame(table(data$start_station_id))
colnames(tbl.df)[1] <- "start_station_id"
data <- merge(data, tbl.df)

data.table

data.table::setDT(data)
data[, `Number of rows` := .N, by = start_station_id]

the := is a data.table function that creates new columns, the .N give the count of the rows for the current group, the by specifies by which column to group. :=是创建新列的data.table function , .N给出当前组的行数, by指定要分组的列。 this automatically adds the column Number of rows to the data.table .这会自动将列Number of rows添加到data.table中。 for an introduction on the data.table package check the vignette .有关data.table package 的介绍,请查看小插图 this implementation is the fastest.这个实现是最快的。

The approach is feasible but you should not use a data.frame to put your fresh data in as you don't know the row counts, yet.该方法是可行的,但您不应该使用 data.frame 来放入您的新数据,因为您还不知道行数。 Just use a list and convert it to data.frame in the end:只需使用一个列表并将其转换为最后的 data.frame :

stationData = list("station" = unique(c(data$start_station_id, data$end_station_id)), ##  Create a list of all stations
                   "starts" = c(),

for(i in i:length(stationData$station)){
   s = stationData$station[i]
   stationData$starts[i] <- sum(data$start_station_id == s)
}
stationData = as.data.frame(stationData)

At the end of the for-loop the length of both columns will be identical and there will be no problem to create this data.在 for 循环结束时,两列的长度将相同,创建此数据不会有问题。

A much easier way, though, would be to use the table() function, which computes the number of starting stations automatically and was already proposed by Abdessabour Mtk.不过,更简单的方法是使用table() function,它自动计算起始站的数量,并且已经由 Abdessabour Mtk 提出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM