简体   繁体   English

r数据框的顺序和选择

[英]r order of dataframe and selection

I would appreciate if someone could give me some direction of how to solve a complex ordering of a matrix and selection of the top 2 elements in each subcategory. 如果有人能给我一些指导,以解决矩阵的复杂排序以及每个子类别中前2个元素的选择,我将不胜感激。

code: 码:

index<-1:14
metric<-c(0.037777,0.041143,0.041043,0.042056,0.043701,0.042169,0.042134,
          0.046565,0.044638,0.036653,0.046221,0.04033,0.045385,0.043873)
cat_1<-c("California Munis","California Munis","California Munis","California Munis",
         "California Munis","California Munis","California Munis","Corporate Bonds",
         "Corporate Bonds","Corporate Bonds","Government Bonds","Government Bonds",
         "High Yield Bonds","High Yield Bonds")
cat_2<-c("California Munis","Corporate Bonds","Corporate Bonds","Government Bonds",
         "High Yield Bonds","High Yield Bonds","High Yield Bonds","High Yield Bonds",
         "High Yield Bonds","High Yield Bonds","California Munis","California Munis",
         "Corporate Bonds","Corporate Bonds")

data<-data.frame(cbind(index,metric,cat_1,cat_2))

which produces the below matrix 产生下面的矩阵

Ind Metric     Cat_1                Cat_2
1   0.037777    California Munis    California Munis
2   0.041143    California Munis    Corporate Bonds
3   0.041043    California Munis    Corporate Bonds
4   0.042056    California Munis    Government Bonds
5   0.043701    California Munis    High Yield Bonds
6   0.042169    California Munis    High Yield Bonds
7   0.042134    California Munis    High Yield Bonds
8   0.046565    Corporate Bonds     High Yield Bonds
9   0.044638    Corporate Bonds     High Yield Bonds
10  0.036653    Corporate Bonds     High Yield Bonds
11  0.046221    Government Bonds    California Munis
12  0.04033     Government Bonds    California Munis
13  0.045385    High Yield Bonds    Corporate Bonds
14  0.043873    High Yield Bonds    Corporate Bonds

Given the matrix above I would like to order based on the Cat_1, Cat_2 and Metric. 给定上面的矩阵,我想基于Cat_1,Cat_2和Metric进行订购。 i have tried this: 我已经试过了:

data[order(data[,3],data[,4],data[,2]),]

However Cat_1 and Cat_2 should be indifferent if their entries are the same. 但是,如果Cat_1和Cat_2的条目相同,则它们应该无关紧要。 As an example, "California Munis"&"Corporate Bonds"="Corporate Bonds"&"California Munis". 例如,“ California Munis”和“ Corporate Bonds” =“ Corporate Bonds”&“ California Munis”。 the outcome I am looking to get should look like the result in the following matrix 我希望获得的结果应类似于以下矩阵中的结果

Ind Metric      Cat_1               Cat_2               Selection
1   0.037777    California Munis    California Munis    1
2   0.041143    California Munis    Corporate Bonds     1
3   0.041043    California Munis    Corporate Bonds     2
11  0.046221    Government Bonds    California Munis    1
4   0.042056    California Munis    Government Bonds    2
12  0.04033     Government Bonds    California Munis    
5   0.043701    California Munis    High Yield Bonds    1
6   0.042169    California Munis    High Yield Bonds    2
7   0.042134    California Munis    High Yield Bonds    
8   0.046565    Corporate Bonds     High Yield Bonds    1
13  0.045385    High Yield Bonds    Corporate Bonds     2
9   0.044638    Corporate Bonds     High Yield Bonds    
14  0.043873    High Yield Bonds    Corporate Bonds 
10  0.036653    Corporate Bonds     High Yield Bonds    

The last column presents the selection of the top 2 lines per every subcategory that I need to extract. 最后一列显示了我需要提取的每个子类别的前2行的选择。

Any ideas or code would be highly appreciated. 任何想法或代码将不胜感激。

Thanks 谢谢

Please abandon the use of data.frame(cbind(...)) . 请放弃使用data.frame(cbind(...)) It will only cause you grief. 只会让你悲伤。

 newdat <- data[ with( data, 
                order( pmax( as.numeric(cat_1), as.numeric(cat_2) ), 
                       pmin( as.numeric(cat_1), as.numeric(cat_2) ) ,
                     - metric) ) , ]
 newdat$selection <- ave(index, 
                         first=pmax( as.numeric(newdat$cat_1), 
                                     as.numeric(newdat$cat_2) ), 
                        second= pmin( as.numeric(newdat$cat_1), 
                                      as.numeric(newdat$cat_2) ) ,
                         FUN=seq)
#-----------------------------------------
> newdat
   index   metric            cat_1            cat_2 selection
1      1 0.037777 California Munis California Munis         1
2      2 0.041143 California Munis  Corporate Bonds         1
3      3 0.041043 California Munis  Corporate Bonds         2
11    11 0.046221 Government Bonds California Munis         1
4      4 0.042056 California Munis Government Bonds         2
12    12 0.040330 Government Bonds California Munis         3
5      5 0.043701 California Munis High Yield Bonds         1
6      6 0.042169 California Munis High Yield Bonds         2
7      7 0.042134 California Munis High Yield Bonds         3
8      8 0.046565  Corporate Bonds High Yield Bonds         1
13    13 0.045385 High Yield Bonds  Corporate Bonds         2
9      9 0.044638  Corporate Bonds High Yield Bonds         3
14    14 0.043873 High Yield Bonds  Corporate Bonds         4
10    10 0.036653  Corporate Bonds High Yield Bonds         5

The requirement for success here is that the levels in the two cat variables are the same. 成功的前提是两个cat变量中的级别相同。 If not, then make them the same with levels(.) <- union(levels(cat1, levels(cat_2)) 如果不是,则使它们与levels(.) <- union(levels(cat1, levels(cat_2))

I expand on my comment 我扩大我的评论

# introduce combined category
cat3 <- sapply(paste(data$cat_1,data$cat_2,sep=" "),function(x){paste(sort(strsplit(x," ")[[1]]), collapse=" ")})
data$cat_3 <- cat3
# order as desired
data1 <- data[order( cat_3 , -metric), ]
# label and select top 2 in each cat
data1$rankByCat <- unlist(sapply(unique(data1$cat_3), function(mycat, mydf)  {return(1:sum(mydf$cat_3==mycat))}, mydf=data1))
data1[data1$rankByCat < 3, !names(data1)%in%c("cat_3")]

@andrei @安德烈

I have got the sorting part with the following code: 我有以下代码的排序部分:

#concacenate the 2 strings
cat_3<-paste(data[,3],data[,4],sep="  ")

#break the string to 2 (creates a list)
temp_split<-strsplit(cat_3,"  ")

#sort by row
sort_split<-sapply(temp_split,sort)

#bind split
out<-cbind(data,t(sort_split))

Is that the best way to write it? 那是最好的写法吗?

How would I proceed from here to select the top 2 of each category? 我将如何从这里开始选择每个类别的前2个?

Thanks for the help! 谢谢您的帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM