[英]Summing/Counting equal values/rows of a data frame
I have a dataframe that contains a bunch of individual trip data, with a start and end station ID for each trip.我有一个 dataframe 包含一堆单独的行程数据,每次行程都有一个起点和终点站 ID。
I'm trying to make a second dataframe that is all the info rearranged for each station.我正在尝试制作第二个 dataframe ,这是为每个站点重新排列的所有信息。 So for example, if there are 50 trips with start_station_id == 12 in the first dataframe, then the second dataframe under station_id 12 would have its "starts" column equal to 50
因此,例如,如果在第一个 dataframe 中有 50 个行程 start_station_id == 12,那么在 station_id 12 下的第二个 dataframe 的“开始”列将等于 50
Currently I figured a for
loop would be the best method for this, but I cant seem to crack it目前我认为
for
循环将是最好的方法,但我似乎无法破解它
for(i in range(station_ids)){
stationData$starts[i] <- sum(data$start_station_id[i] == station_ids[i])
}
This produces the following error:这会产生以下错误:
Error in `$<-.data.frame`(`*tmp*`, starts, value = c(0, 0, 0, 0, 0, 0, :
replacement has 370 rows, data has 369
station_id's is a variable that contains each unique station id #, stationData$starts is where I want the number of starts stored. station_id's 是一个包含每个唯一站 id # 的变量,stationData$starts 是我想要存储开始次数的位置。 Data is the original data I am trying to run the for loop over.
数据是我试图运行 for 循环的原始数据。
Is there an easier way to complete this operation or am I just writing the for loop wrong?有没有更简单的方法来完成这个操作,还是我只是写错了 for 循环? Any tips would be super helpful
任何提示都会非常有帮助
From what I have understood from your question you're trying to count the occurences of each station_id
this can be easily achieved by the table
function which returns a table
object ie a named vector containing the count and station_id
as the names.根据我从您的问题中了解到的情况,您正在尝试计算每个
station_id
的出现次数,这可以通过table
function 轻松实现,该表返回table
object 即包含计数和station_id
作为名称的命名向量。
R
R
table(data$start_station_id)
data.frame(table(data$start_station_id)) #if you prefer the data.frame look
If you want to have the number of occurence merged into your old data.frame
can inner join the two data.frames
using the merge
function如果您想将出现次数合并到旧数据中,可以使用
merge
data.frame
内部连接两个data.frames
框
tbl.df <-data.frame(table(data$start_station_id))
colnames(tbl.df)[1] <- "start_station_id"
data <- merge(data, tbl.df)
data.table
data.table::setDT(data)
data[, `Number of rows` := .N, by = start_station_id]
the :=
is a data.table
function that creates new columns, the .N
give the count of the rows for the current group, the by
specifies by which column to group. :=
是创建新列的data.table
function , .N
给出当前组的行数, by
指定要分组的列。 this automatically adds the column Number of rows
to the data.table
.这会自动将列
Number of rows
添加到data.table
中。 for an introduction on the data.table
package check the vignette .有关
data.table
package 的介绍,请查看小插图。 this implementation is the fastest.这个实现是最快的。
The approach is feasible but you should not use a data.frame to put your fresh data in as you don't know the row counts, yet.该方法是可行的,但您不应该使用 data.frame 来放入您的新数据,因为您还不知道行数。 Just use a list and convert it to data.frame in the end:
只需使用一个列表并将其转换为最后的 data.frame :
stationData = list("station" = unique(c(data$start_station_id, data$end_station_id)), ## Create a list of all stations
"starts" = c(),
for(i in i:length(stationData$station)){
s = stationData$station[i]
stationData$starts[i] <- sum(data$start_station_id == s)
}
stationData = as.data.frame(stationData)
At the end of the for-loop the length of both columns will be identical and there will be no problem to create this data.在 for 循环结束时,两列的长度将相同,创建此数据不会有问题。
A much easier way, though, would be to use the table()
function, which computes the number of starting stations automatically and was already proposed by Abdessabour Mtk.不过,更简单的方法是使用
table()
function,它自动计算起始站的数量,并且已经由 Abdessabour Mtk 提出。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.