[英]Calculate similarity within a dataframe across specific rows (R)
I have a dataframe that looks something like this:我有一个看起来像这样的 dataframe:
df <- data.frame("index" = 1:10, "title" = c("Sherlock","Peaky Blinders","Eastenders","BBC News", "Antiques Roadshow","Eastenders","BBC News","Casualty", "Dragons Den","Peaky Blinders"), "date" = c("01/01/20","01/01/20","01/01/20","01/01/20","01/01/20","02/01/20","02/01/20","02/01/20","02/01/20","02/01/20"))
The output looks like this: output 看起来像这样:
Index Title Date
1 Sherlock 01/01/20
2 Peaky Blinders 01/01/20
3 Eastenders 01/01/20
4 BBC News 01/01/20
5 Antiques Roadshow 01/01/20
6 Eastenders 02/01/20
7 BBC News 02/01/20
8 Casualty 02/01/20
9 Dragons Den 02/01/20
10 Peaky Blinders 02/01/20
I want to be able to determine the number of times that a title appears on different dates.我希望能够确定标题在不同日期出现的次数。 In the example above, "BBC News", "Peaky Blinders" and "Eastenders" all appear on 01/01/20 and 02/01/20.
在上面的例子中,“BBC News”、“Peaky Blinders”和“Eastenders”都出现在 01/01/20 和 02/01/20。 The similarity between the two dates is therefore 60% (3 out of 5 titles are identical across both dates).
因此,两个日期之间的相似性为 60%(两个日期的 5 个标题中有 3 个相同)。
It's probably also worth mentioning that the actual dataframe is much larger, and has 120 titles per day, and spans some 700 days.可能还值得一提的是,实际的 dataframe 要大得多,每天有 120 个标题,跨越约 700 天。 I need to compare the "titles" of each "date" with the previous "date" and then calculate their similarity.
我需要将每个“日期”的“标题”与前一个“日期”进行比较,然后计算它们的相似度。 So to be clear, I need to determine the similarity of 01/01/20 with 02/01/20, 02/01/20 with 03/01/20, 03/01/20 with 04/01/20, and so on...
所以要清楚,我需要确定 01/01/20 与 02/01/20、02/01/20 与 03/01/20、03/01/20 与 04/01/20 的相似性,等等上...
Does anyone have any idea how I might go about doing this?有谁知道我怎么可能 go 这样做? My eventual aim is to use Tableau to visualise similarity/difference over time, but I fear that such a calculation would be too complicated for that particular software and I'll have to somehow add it into the actual data itself.
我的最终目标是使用 Tableau 来可视化一段时间内的相似性/差异,但我担心这样的计算对于那个特定的软件来说太复杂了,我必须以某种方式将它添加到实际数据本身中。
I have come up with this solution.我想出了这个解决方案。 However, I'm unsure about how will it work when the number of records per day is different (ie you have 8 titles for day n and 15 titles for day n+1).
但是,我不确定当每天的记录数不同时它会如何工作(即,第 n 天有 8 个标题,第 n+1 天有 15 个标题)。 I guess you would like to normalize with respect to the day with more records.
我猜你想用更多记录来规范化这一天。 Anyway, here it comes:
无论如何,它来了:
divide <- split.data.frame(df, as.factor(df$date))
similarity <- vector()
for(i in 1:(length(divide)-1)){
index <- sum((divide[[i]]$title) %in% divide[[i+1]]$title)/max(c(length(divide[[i]]$title), length((divide[[i+1]]$title))))
similarity <- c(similarity, index)
}
similarity
Here is another possibility.这是另一种可能性。 You can create a simple function to calculate the similarity or other index between groups.
您可以创建一个简单的 function 来计算组之间的相似度或其他指标。 Then, split your data frame by date into a list, and
lapply
the custom function to each in the list (final result will be a list).然后,按日期将您的数据框拆分为一个列表,并将自定义
lapply
应用于列表中的每个(最终结果将是一个列表)。
calc_similar <- function(i) {
sum(s[[i]] %in% s[[i-1]])/length(s[[i-1]])
}
s <- split(df$title, df$date)
setNames(lapply(seq_along(s)[-1], calc_similar), names(s)[-1])
Output Output
$`2020-01-02`
[1] 0.6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.