繁体   English   中英

R创建一个数据框,总结来自另一个数据框的每列唯一值并排序

[英]R Create a data frame summarising per column unique values from another data frame and sort

这个问题是从这里开始的讨论开始的。 我觉得进一步的详细说明需要发布一个新问题,因为它与原始帖子中的问题不同。 抱歉,如果我这样做有误。

给定一个类似于以下的数据框:

mydf <- data.frame(X1=c("0_times","3-10_times", "11-20_times", "1-2_times","3-10_times",
                        "0_times","3-10_times", "11-20_times", "1-2_times","3-10_times" ),
                   X2=c('ab','bb','cb','db','eb','ab','bb','cb','db','eb'),
                   X3=c("11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times",
                        "11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times"),
                   X4=c("foo", "bar","fizz","buzz","weee","foo", "bar","fizz","buzz","weee"),
                   X5=c("3-10_times","1-2_times","0_times","more_than_30_times","11-20_times",
                        "21-30_times","1-2_times","0_times","3-10_times","11-20_times")
                   )

我想创建第二个数据框来存储列名和来自第一个数据框的唯一值的列表/向量。

导致类似:

 names   vals
1    X1 0_times, 1-2_times, 3-10_times, 11-20_times
2    X2 ab, bb, cb, db, eb
3    X3 1-2_times, 3-10_times, 11-20_times , 21-30_times, more_than_30_times 
4    X4 foo, bar, fizz, buzz, weee 
1    X5 0_times,1-2_times,3-10_times,11-20_times,21-30_times

我使用以下方法创建第二个数据框:

mydf2 <- data.frame(names = colnames(mydf))
mydf2$vals <- lapply(mydf, unique)

我认为到目前为止还可以。 但是,我面临的挑战是我需要包含数字的向量(在这种情况下只有mydf2$X1 )以升序排序,而不仅仅是每个项目的第一个数字。

Stack 用户的大力帮助下,建议将其作为对包含数字的向量进行排序的一种方法,并且它可以在单个向量上完美运行:

mylist <- c('0_times','3-10_times','11_20_times','1-2_times','more_than_20_times')

o <- sapply(strsplit(mylist, '\\D+'), function(x) min(as.numeric(x[nzchar(x)])))
mylist[order(o)]

当我尝试通过替换列名将其应用于整个mydf2$vals列时:

o <- sapply(strsplit(mydf2$vals, '\\D+'), function(x) min(as.numeric(x[nzchar(x)])))
mydf2$vals[order(o)]

error in evaluating the argument 'X' in selecting a method for function 'sapply': non-character argument

所以我有两个问题:

  1. 有没有更简单的方法来实现我的目标?
  2. 如何修改建议的排序功能,以免发生错误?

也许只是重置mydf的顺序(即将前导 0 粘贴到X1列):

mydf <- mydf[order(ifelse(substr(mydf$X1,1,2)!=11, paste0("0",mydf$X1), mydf$X1)),]

然后为mydf2运行上面相同的代码,你会得到这个:

  names                                        vals
1    X1 0_times, 1-2_times, 3-10_times, 11-20_times
2    X2                          ab, db, bb, eb, cb

这是预期的输出吗?

您可以将@Allan Cameron的逻辑包装在一个函数中,将其扩展为对列的unique值进行排序。 if包含数字, grepl会告诉我们,我们应用逻辑。 我称它为makeLevels是因为它可能对创建因子水平也很有用。

makeLevels <- \(x) {
  if (any(grepl('\\d', x))) {
    unique(x[order(sapply(strsplit(x, '\\D+'), function(x) min(as.numeric(x[nzchar(x)]))))])
  } else {
    sort(unique(x))
  }
}

lapply(names(mydf), \(x) data.frame(names=x, vals=toString(makeLevels(mydf[[x]])))) |>
  do.call(what=rbind)
#   names                                                                         vals
# 1    X1                                  0_times, 1-2_times, 3-10_times, 11-20_times
# 2    X2                                                           ab, bb, cb, db, eb
# 3    X3          1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times
# 4    X4                                                   bar, buzz, fizz, foo, weee
# 5    X5 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times

或者做这样的事情,

lapply(names(mydf), \(x) sprintf('$ %s <%s>: %s', x, class(mydf[[x]]), toString(makeLevels(mydf[[x]])))) |>
  do.call(what=rbind)
#      [,1]                                                                                            
# [1,] "$ X1 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times"                                 
# [2,] "$ X2 <character>: ab, bb, cb, db, eb"                                                          
# [3,] "$ X3 <character>: 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"         
# [4,] "$ X4 <character>: bar, buzz, fizz, foo, weee"                                                  
# [5,] "$ X5 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"

这类似于str(mydf)所做的; 如果您只想阅读它,您实际上可以makeLevels lapply str它。

str(lapply(mydf, makeLevels), vec.len=10L)
# List of 5
#  $ X1: chr [1:4] "0_times" "1-2_times" "3-10_times" "11-20_times"
#  $ X2: chr [1:5] "ab" "bb" "cb" "db" "eb"
#  $ X3: chr [1:5] "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"
#  $ X4: chr [1:5] "bar" "buzz" "fizz" "foo" "weee"
#  $ X5: chr [1:6] "0_times" "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"

...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM