[英]Function that subsets a dataframe to the “best square” in R
My objective is to write a function in R that takes a dataframe as input and returns the "best square subset" of it. 我的目标是在R中编写一个将数据框作为输入并返回其“最佳平方子集”的函数。
By best square subset I mean that the output needs to confirm the below: 最好的平方子集,我的意思是输出需要确认以下内容:
Let's take the three following examples: 让我们以以下三个示例为例:
example1 <- structure(list(Afternoon = c(20800L, 15254L, 17426L, 4391L, 39194L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L,
3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
), class = "data.frame")
example2 <- structure(list(Afternoon = c(1227364L, 219402L, 3L, 0L, 530891L,
153124L, 281788L), Evening = c(570618L, 167216L, 31L, 10L, 88702L,
161006L, 42L), Morning = c(0L, 121775L, 0L, 0L, 0L, 25133L, 270162L
)), .Names = c("Afternoon", "Evening", "Morning"), row.names = c("Friday",
"Monday", "Saturday", "Sunday", "Thursday", "Tuesday", "Wednesday"
), class = "data.frame")
example3 <- structure(list(Afternoon = c(20800L, 258L, 300L, 563L, 2000L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L,
3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
), class = "data.frame")
That look like this: 看起来像这样:
> example1
Afternoon Evening Morning
Friday 20800 21679 0
Monday 15254 0 3726
Thursday 17426 2973 0
Tuesday 4391 37 0
Wednesday 39194 435 0
> example2
Afternoon Evening Morning
Friday 1227364 570618 0
Monday 219402 167216 121775
Saturday 3 31 0
Sunday 0 10 0
Thursday 530891 88702 0
Tuesday 153124 161006 25133
Wednesday 281788 42 270162
> example3
Afternoon Evening Morning
Friday 20800 21679 0
Monday 258 0 3726
Thursday 300 2973 0
Tuesday 563 37 0
Wednesday 2000 435 0
The function I'm looking for should subset the above 3 examples to the following 3 respectively: 我正在寻找的功能应将上述3个示例分别分为以下3个:
> output1
Afternoon
Friday 20800
Monday 15254
Thursday 17426
Tuesday 4391
Wednesday 39194
The score/area of the square is 5. Anything else would be less. 正方形的得分/面积是5。 For instance, selecting Friday,Thursday Afternoon evening would yield a score of 4
例如,选择“星期五”,“星期四”,“下午”,晚上的得分为4
> output2
Afternoon Evening
Friday 1227364 570618
Monday 219402 167216
Thursday 530891 88702
Tuesday 153124 161006
Here, someone's first thought would be to select all Monday, Tuesday & wednesday times and get a score of 9. However, Wendesday evenings's 42 validates the first criterion. 在这里,某人的第一个想法是选择星期一,星期二和星期三的所有时间,并获得9分。但是,Wendesday晚间的42分验证了第一个标准。 Monday & Tuesday all days and times would yield a score/area of 6
星期一和星期二的所有天和时间总分/区域为6
> output3
Afternoon Evening
Friday 20800 21679
Here we have two possible squares that validate the first 2 criteria: Friday afternoon and evening or Friday and Wednesday afternoon. 在这里,我们有两个可能的正方形可以验证前两个条件:星期五下午和晚上或星期五和星期三下午。 We have to go with the first choice as the sum inside the cells is higher than in the second case.
我们必须选择第一种方法,因为单元格内的总和要高于第二种情况。 This rule is applied only in the case of ties.
此规则仅在平局的情况下适用。
The most intuitive solution would be to go through all possible combinations of the rows and columns and check whether the selected rows and columns form a full square or not, if yes, then check if that form the largest number possible. 最直观的解决方案是遍历行和列的所有可能组合,并检查选定的行和列是否形成一个完整的正方形,如果是,则检查是否形成最大的正方形。 A potential problem with this approach would be that if you have many columns and rows, it will take ages to finish which is not optimal.
这种方法的潜在问题是,如果您有许多列和行,则需要花费一定的时间才能完成,这不是最佳选择。 My answer to the question which comes here work reasonably well (on my PC with 16GB RAM, 2.7 GHz CPU, and Windows 10pro 64 bit, R version 3.5.1) when number of columns and rows are not very larger than 12.
当列和行的数量不大于12时,我对这里出现的问题的回答相当有效(在具有16GB RAM,2.7 GHz CPU和Windows 10pro 64位,R版本3.5.1的PC上)
#library(gtools)
find_best_square <- function(x, thresh = 2000){
# x <- example1
x[x<thresh] <- 0
# for larger datasets only: removing lonely cells
if (ncol(x) > 7 | nrow(x)> 7){
for (i in 1:nrow(x)){
for (j in 1:ncol(x)){
if((colSums(x[,j,drop=F]) == x[i,j]) & (rowSums(x[i,,drop=F])==x[i,j])) x[i, j] <- 0L
}
}
}
# remove columns with no data
is_colZero <- colSums(x==0)== nrow(x)
if(any(is_colZero)) print(paste('this column is empty and removed: ', which(is_colZero)))
x <- x[,!is_colZero]
# remove rows with no data
is_rowZero <- rowSums(x==0)==ncol(x)
if(any(is_rowZero)) print(paste('this row is empty and removed: ', which(is_rowZero)))
x <- x[!is_rowZero,]
n <- ncol(x)
m <- nrow(x)
max_size <- 0L
max_sum <- 0L
jump_i <- 0L
jump_j <- 0L
for (i in n:1){ # cols
# all possible combination
next_max <- m * (i-1)
if(max_size!=0 & next_max < max_size & i * m < max_size) {
jump_i <- jump_i + 1
next()
}
comb_col <- combinations(n,i)
for (k in 1:nrow(comb_col)){
col <- as.integer(comb_col[k,])
for(j in m:1){ # rows
if (i*j < max_size ) {
jump_j <- jump_j +1
next()
}
comb_row <- combinations(m,j)
for (l in 1:nrow(comb_row)){
row <- as.integer(comb_row[l,])
y <- x[row, col, drop=F]
if(all(y > 0) & max_size <= length(row)*length(col)){
if(max_size == length(row)*length(col)){
if(sum(y) > max_sum){
max_size <- length(row) * length(col)
max_cols <- col
max_rows <- row
max_sum <- sum(y)}
} else {
max_size <- length(row) * length(col)
max_cols <- col
max_rows <- row
max_sum <- sum(y)
}
}
}
}
}
}
return(x[max_rows,max_cols, drop=F])
}
hope this will work for you, any question please email me. 希望这对您有用,如有任何问题,请给我发送电子邮件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.