[英]R Create a data frame summarising per column unique values from another data frame and sort
这个问题是从这里开始的讨论开始的。 我觉得进一步的详细说明需要发布一个新问题,因为它与原始帖子中的问题不同。 抱歉,如果我这样做有误。
给定一个类似于以下的数据框:
mydf <- data.frame(X1=c("0_times","3-10_times", "11-20_times", "1-2_times","3-10_times",
"0_times","3-10_times", "11-20_times", "1-2_times","3-10_times" ),
X2=c('ab','bb','cb','db','eb','ab','bb','cb','db','eb'),
X3=c("11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times",
"11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times"),
X4=c("foo", "bar","fizz","buzz","weee","foo", "bar","fizz","buzz","weee"),
X5=c("3-10_times","1-2_times","0_times","more_than_30_times","11-20_times",
"21-30_times","1-2_times","0_times","3-10_times","11-20_times")
)
我想创建第二个数据框来存储列名和来自第一个数据框的唯一值的列表/向量。
导致类似:
names vals
1 X1 0_times, 1-2_times, 3-10_times, 11-20_times
2 X2 ab, bb, cb, db, eb
3 X3 1-2_times, 3-10_times, 11-20_times , 21-30_times, more_than_30_times
4 X4 foo, bar, fizz, buzz, weee
1 X5 0_times,1-2_times,3-10_times,11-20_times,21-30_times
我使用以下方法创建第二个数据框:
mydf2 <- data.frame(names = colnames(mydf))
mydf2$vals <- lapply(mydf, unique)
我认为到目前为止还可以。 但是,我面临的挑战是我需要包含数字的向量(在这种情况下只有mydf2$X1
)以升序排序,而不仅仅是每个项目的第一个数字。
在Stack 用户的大力帮助下,建议将其作为对包含数字的向量进行排序的一种方法,并且它可以在单个向量上完美运行:
mylist <- c('0_times','3-10_times','11_20_times','1-2_times','more_than_20_times')
o <- sapply(strsplit(mylist, '\\D+'), function(x) min(as.numeric(x[nzchar(x)])))
mylist[order(o)]
当我尝试通过替换列名将其应用于整个mydf2$vals
列时:
o <- sapply(strsplit(mydf2$vals, '\\D+'), function(x) min(as.numeric(x[nzchar(x)])))
mydf2$vals[order(o)]
error in evaluating the argument 'X' in selecting a method for function 'sapply': non-character argument
所以我有两个问题:
也许只是重置mydf
的顺序(即将前导 0 粘贴到X1
列):
mydf <- mydf[order(ifelse(substr(mydf$X1,1,2)!=11, paste0("0",mydf$X1), mydf$X1)),]
然后为mydf2
运行上面相同的代码,你会得到这个:
names vals
1 X1 0_times, 1-2_times, 3-10_times, 11-20_times
2 X2 ab, db, bb, eb, cb
这是预期的输出吗?
您可以将@Allan Cameron的逻辑包装在一个函数中,将其扩展为对列的unique
值进行排序。 if
包含数字, grepl
会告诉我们,我们应用逻辑。 我称它为makeLevels
是因为它可能对创建因子水平也很有用。
makeLevels <- \(x) {
if (any(grepl('\\d', x))) {
unique(x[order(sapply(strsplit(x, '\\D+'), function(x) min(as.numeric(x[nzchar(x)]))))])
} else {
sort(unique(x))
}
}
lapply(names(mydf), \(x) data.frame(names=x, vals=toString(makeLevels(mydf[[x]])))) |>
do.call(what=rbind)
# names vals
# 1 X1 0_times, 1-2_times, 3-10_times, 11-20_times
# 2 X2 ab, bb, cb, db, eb
# 3 X3 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times
# 4 X4 bar, buzz, fizz, foo, weee
# 5 X5 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times
或者做这样的事情,
lapply(names(mydf), \(x) sprintf('$ %s <%s>: %s', x, class(mydf[[x]]), toString(makeLevels(mydf[[x]])))) |>
do.call(what=rbind)
# [,1]
# [1,] "$ X1 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times"
# [2,] "$ X2 <character>: ab, bb, cb, db, eb"
# [3,] "$ X3 <character>: 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"
# [4,] "$ X4 <character>: bar, buzz, fizz, foo, weee"
# [5,] "$ X5 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"
这类似于str(mydf)
所做的; 如果您只想阅读它,您实际上可以makeLevels
lapply
str
它。
str(lapply(mydf, makeLevels), vec.len=10L)
# List of 5
# $ X1: chr [1:4] "0_times" "1-2_times" "3-10_times" "11-20_times"
# $ X2: chr [1:5] "ab" "bb" "cb" "db" "eb"
# $ X3: chr [1:5] "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"
# $ X4: chr [1:5] "bar" "buzz" "fizz" "foo" "weee"
# $ X5: chr [1:6] "0_times" "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.