简体   繁体   English

将数据框子集为R中“最佳平方”的函数

[英]Function that subsets a dataframe to the “best square” in R

My objective is to write a function in R that takes a dataframe as input and returns the "best square subset" of it. 我的目标是在R中编写一个将数据框作为输入并返回其“最佳平方子集”的函数。

By best square subset I mean that the output needs to confirm the below: 最好的平方子集,我的意思是输出需要确认以下内容:

  • All cells/elements are above 2000 所有单元格/元素均高于2000
  • It has as many cells/elements as possible 它具有尽可能多的单元格/元素
  • In case of ties (2 data frames that fit the above criteria and contain the same count of cells) return the one with the highest sum of the cells/elements 如果是平局(2个符合上述条件并包含相同单元格数的数据帧),则返回单元格/元素总和最高的一个

Let's take the three following examples: 让我们以以下三个示例为例:

example1 <- structure(list(Afternoon = c(20800L, 15254L, 17426L, 4391L, 39194L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L, 
                                                          3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
                                                          ), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
                                                          ), class = "data.frame")

example2 <- structure(list(Afternoon = c(1227364L, 219402L, 3L, 0L, 530891L, 
                                         153124L, 281788L), Evening = c(570618L, 167216L, 31L, 10L, 88702L, 
                                                                        161006L, 42L), Morning = c(0L, 121775L, 0L, 0L, 0L, 25133L, 270162L
                                                                        )), .Names = c("Afternoon", "Evening", "Morning"), row.names = c("Friday", 
                                                                                                                                         "Monday", "Saturday", "Sunday", "Thursday", "Tuesday", "Wednesday"
                                                                        ), class = "data.frame")

example3 <- structure(list(Afternoon = c(20800L, 258L, 300L, 563L, 2000L
), Evening = c(21679L, 0L, 2973L, 37L, 435L), Morning = c(0L, 
                                                          3726L, 0L, 0L, 0L)), .Names = c("Afternoon", "Evening", "Morning"
                                                          ), row.names = c("Friday", "Monday", "Thursday", "Tuesday", "Wednesday"
                                                          ), class = "data.frame")

That look like this: 看起来像这样:

> example1
          Afternoon Evening Morning
Friday        20800   21679       0
Monday        15254       0    3726
Thursday      17426    2973       0
Tuesday        4391      37       0
Wednesday     39194     435       0

> example2
          Afternoon Evening Morning
Friday      1227364  570618       0
Monday       219402  167216  121775
Saturday          3      31       0
Sunday            0      10       0
Thursday     530891   88702       0
Tuesday      153124  161006   25133
Wednesday    281788      42  270162

> example3
          Afternoon Evening Morning
Friday        20800   21679       0
Monday          258       0    3726
Thursday        300    2973       0
Tuesday         563      37       0
Wednesday      2000     435       0

The function I'm looking for should subset the above 3 examples to the following 3 respectively: 我正在寻找的功能应将上述3个示例分别分为以下3个:

> output1
          Afternoon
Friday        20800
Monday        15254
Thursday      17426
Tuesday        4391
Wednesday     39194

The score/area of the square is 5. Anything else would be less. 正方形的得分/面积是5。 For instance, selecting Friday,Thursday Afternoon evening would yield a score of 4 例如,选择“星期五”,“星期四”,“下午”,晚上的得分为4

> output2
         Afternoon Evening
Friday     1227364  570618
Monday      219402  167216
Thursday    530891   88702
Tuesday     153124  161006

Here, someone's first thought would be to select all Monday, Tuesday & wednesday times and get a score of 9. However, Wendesday evenings's 42 validates the first criterion. 在这里,某人的第一个想法是选择星期一,星期二和星期三的所有时间,并获得9分。但是,Wendesday晚间的42分验证了第一个标准。 Monday & Tuesday all days and times would yield a score/area of 6 星期一和星期二的所有天和时间总分/区域为6

> output3
       Afternoon Evening
Friday     20800   21679

Here we have two possible squares that validate the first 2 criteria: Friday afternoon and evening or Friday and Wednesday afternoon. 在这里,我们有两个可能的正方形可以验证前两个条件:星期五下午和晚上或星期五和星期三下午。 We have to go with the first choice as the sum inside the cells is higher than in the second case. 我们必须选择第一种方法,因为单元格内的总和要高于第二种情况。 This rule is applied only in the case of ties. 此规则仅在平局的情况下适用。

The most intuitive solution would be to go through all possible combinations of the rows and columns and check whether the selected rows and columns form a full square or not, if yes, then check if that form the largest number possible. 最直观的解决方案是遍历行和列的所有可能组合,并检查选定的行和列是否形成一个完整的正方形,如果是,则检查是否形成最大的正方形。 A potential problem with this approach would be that if you have many columns and rows, it will take ages to finish which is not optimal. 这种方法的潜在问题是,如果您有许多列和行,则需要花费一定的时间才能完成,这不是最佳选择。 My answer to the question which comes here work reasonably well (on my PC with 16GB RAM, 2.7 GHz CPU, and Windows 10pro 64 bit, R version 3.5.1) when number of columns and rows are not very larger than 12. 当列和行的数量不大于12时,我对这里出现的问题的回答相当有效(在具有16GB RAM,2.7 GHz CPU和Windows 10pro 64位,R版本3.5.1的PC上)

#library(gtools)

find_best_square <- function(x, thresh = 2000){
    # x <- example1
    x[x<thresh] <- 0

    # for larger datasets only: removing lonely cells
    if (ncol(x) > 7 | nrow(x)> 7){
        for (i in 1:nrow(x)){
            for (j in 1:ncol(x)){
                if((colSums(x[,j,drop=F]) == x[i,j]) & (rowSums(x[i,,drop=F])==x[i,j])) x[i, j] <- 0L 
            }
        }
    }

    # remove columns with no data
    is_colZero <- colSums(x==0)== nrow(x)
    if(any(is_colZero)) print(paste('this column is empty and removed: ', which(is_colZero)))
    x <- x[,!is_colZero]

    # remove rows with no data
    is_rowZero <- rowSums(x==0)==ncol(x)
    if(any(is_rowZero)) print(paste('this row is empty and removed: ', which(is_rowZero)))
    x <- x[!is_rowZero,]

    n <- ncol(x)
    m <- nrow(x)
    max_size <- 0L
    max_sum <- 0L
    jump_i <- 0L
    jump_j <- 0L

    for (i in n:1){ # cols
        # all possible combination
        next_max <- m  * (i-1)

        if(max_size!=0 & next_max < max_size &  i * m < max_size) {
            jump_i <- jump_i + 1
            next()
        }
        comb_col <- combinations(n,i)
        for (k in 1:nrow(comb_col)){
            col <- as.integer(comb_col[k,])
            for(j in m:1){ # rows
                if (i*j < max_size ) {
                    jump_j <- jump_j +1
                    next()
                }
                comb_row <- combinations(m,j)
                for (l in 1:nrow(comb_row)){
                    row <- as.integer(comb_row[l,])
                    y <- x[row, col, drop=F]
                    if(all(y > 0) & max_size <= length(row)*length(col)){
                        if(max_size == length(row)*length(col)){
                            if(sum(y) > max_sum){ 
                                max_size <- length(row) * length(col)
                                max_cols <- col
                                max_rows <- row
                                max_sum <- sum(y)}
                        } else {
                            max_size <- length(row) * length(col)
                            max_cols <- col
                            max_rows <- row
                            max_sum <- sum(y) 
                        }

                    }

                }
            }

        }


    }   
    return(x[max_rows,max_cols, drop=F])
}

hope this will work for you, any question please email me. 希望这对您有用,如有任何问题,请给我发送电子邮件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM