简体   繁体   中英

How to find the index of maximum values among classes in an unsorted data file by R?

Suppose the content of file.dat is the following:

#a1 a2  b1  b2  c
2.0 1.0 1.0 2.0 0.3
1.0 1.0 1.0 2.0 0.9
1.0 2.0 1.0 2.0 0.6
3.0 3.0 3.0 2.0 0.6
1.0 3.0 1.0 2.0 0.87
2.0 1.0 3.0 2.0 0.9
3.0 1.0 3.0 2.0 0.85
1.0 3.0 1.0 2.0 0.89
1.0 3.0 3.0 2.0 0.7
2.0 1.0 3.0 2.0 0.5
3.0 1.0 2.0 2.0 0.7
1.0 1.0 3.0 2.0 0.88
3.0 2.0 1.0 2.0 0.3
2.0 2.0 1.0 2.0 0.5
2.0 2.0 3.0 2.0 0.8
2.0 3.0 1.0 2.0 0.3
3.0 1.0 3.0 2.0 0.83
1.0 2.0 1.0 2.0 0.3
2.0 3.0 2.0 2.0 0.3
3.0 3.0 3.0 2.0 0.6
1.0 1.0 2.0 2.0 0.8
2.0 3.0 3.0 2.0 0.7
2.0 2.0 3.0 2.0 0.85
1.0 2.0 3.0 2.0 0.81
3.0 2.0 1.0 2.0 0.9
3.0 2.0 3.0 2.0 0.82
3.0 3.0 3.0 2.0 0.84

I want to create a reduced data file including (a1, a2, b1, b2, c) such that for each class of (a1, a2), ie all 3 cases of a1=1, a=1 for example, it returns only the values of those (b1, b2) in which c has its maximum value in this class of (a1, a2) and delete the other rows that have non-maximum "c" values. For example, for the class of (a1, a2)=(1.0, 1,0) the value of maximum c is 0.9. In this case b1=1.0 and b2=2.0 get returned. I want the output of this simple example to be the following.

#a1 a2  b1  b2  c
1.0 1.0 1.0 2.0 0.9
1.0 2.0 3.0 2.0 0.8
1.0 3.0 1.0 2.0 0.9
2.0 1.0 3.0 2.0 0.9
2.0 2.0 3.0 2.0 0.85
2.0 3.0 3.0 2.0 0.8
3.0 1.0 3.0 2.0 0.85
3.0 2.0 1.0 2.0 0.9
3.0 3.0 3.0 2.0 0.8

I want to learn how to do this in R (or preferably in Numpy (python)). Any help is extremely appreciated. I know which.max() might help but honestly I don't know how should I apply this. I am VERY new in R programming as well as Numpy.

You can use a combination of by and do.call functions:

res <- do.call(rbind, by(data,INDICES=list(data$a1,data$a2),FUN=function(x){x[x$c == max(x$c),]}))
res

# > res
#    a1 a2 b1 b2    c
# 2   1  1  1  2 0.90
# 6   2  1  3  2 0.90
# 7   3  1  3  2 0.85
# 24  1  2  3  2 0.81
# 23  2  2  3  2 0.85
# 25  3  2  1  2 0.90
# 8   1  3  1  2 0.89
# 22  2  3  3  2 0.70
# 27  3  3  3  2 0.84

with data being your input data.frame .

In this example data is equal to this:

data <- 
read.csv(sep=',',text=
"a1,a2,b1,b2,c
2.0,1.0,1.0,2.0,0.3
1.0,1.0,1.0,2.0,0.9
1.0,2.0,1.0,2.0,0.6
3.0,3.0,3.0,2.0,0.6
1.0,3.0,1.0,2.0,0.87
2.0,1.0,3.0,2.0,0.9
3.0,1.0,3.0,2.0,0.85
1.0,3.0,1.0,2.0,0.89
1.0,3.0,3.0,2.0,0.7
2.0,1.0,3.0,2.0,0.5
3.0,1.0,2.0,2.0,0.7
1.0,1.0,3.0,2.0,0.88
3.0,2.0,1.0,2.0,0.3
2.0,2.0,1.0,2.0,0.5
2.0,2.0,3.0,2.0,0.8
2.0,3.0,1.0,2.0,0.3
3.0,1.0,3.0,2.0,0.83
1.0,2.0,1.0,2.0,0.3
2.0,3.0,2.0,2.0,0.3
3.0,3.0,3.0,2.0,0.6
1.0,1.0,2.0,2.0,0.8
2.0,3.0,3.0,2.0,0.7
2.0,2.0,3.0,2.0,0.85
1.0,2.0,3.0,2.0,0.81
3.0,2.0,1.0,2.0,0.9
3.0,2.0,3.0,2.0,0.82
3.0,3.0,3.0,2.0,0.84")

In R, you could use dplyr for that task. For each group of a1 and a2 pairs, it will filter (return) only those rows where c equals the maximum c for that group. Note that this may result in several rows per group. I include another example if you only need one row per group.

require(dplyr)          #install the package and load it into your library

dat %.%                  #if `dat` is your input data.frame
   group_by(a1, a2) %.% 
   filter(c == max(c))  

#   a1 a2 b1 b2    c
#1   1  1  1  2 0.90
#2   1  3  1  2 0.80
#3   2  1  3  2 0.90
#4   3  1  3  2 0.85
#5   1  3  1  2 0.80
#6   2  3  3  2 0.70
#7   2  2  3  2 0.85
#8   1  2  3  2 0.80
#9   3  2  1  2 0.90
#10  3  3  3  2 0.80

dat %.% 
  group_by(a1, a2) %.% 
  filter(c == max(c)) %.%  
  filter(1:n() == 1)        #this will make sure you only get the first row of each group

Assuming your data is in a data.frame named dd (something like this)

#sample data
dd <- structure(list(a1 = c(2, 1, 1, 3, 1, 2, 3, 1, 1, 2, 3, 1, 3, 
2, 2, 2, 3, 1, 2, 3, 1, 2, 2, 1, 3, 3, 3), a2 = c(1, 1, 2, 3, 
3, 1, 1, 3, 3, 1, 1, 1, 2, 2, 2, 3, 1, 2, 3, 3, 1, 3, 2, 2, 2, 
2, 3), b1 = c(1, 1, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 1, 3, 1, 
3, 1, 2, 3, 2, 3, 3, 3, 1, 3, 3), b2 = c(2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), 
    c = c(0.3, 0.9, 0.6, 0.6, 0.8, 0.9, 0.85, 0.8, 0.7, 0.5, 
    0.7, 0.8, 0.3, 0.5, 0.8, 0.3, 0.8, 0.3, 0.3, 0.6, 0.8, 0.7, 
    0.85, 0.8, 0.9, 0.8, 0.8)), .Names = c("a1", "a2", "b1", 
"b2", "c"), class = "data.frame", row.names = c(NA, -27L))

then you can then use ave

dd[with(dd, ave(c,a1,a2,FUN=function(x) x==max(x)))==1, ]

to subset to the max value from each a1/a2 group to get

   a1 a2 b1 b2    c
2   1  1  1  2 0.90
5   1  3  1  2 0.80
6   2  1  3  2 0.90
7   3  1  3  2 0.85
8   1  3  1  2 0.80
22  2  3  3  2 0.70
23  2  2  3  2 0.85
24  1  2  3  2 0.80
25  3  2  1  2 0.90
27  3  3  3  2 0.80

If actually wanting to use python try the following:

dictionary = {}
with open("input.dat", "r") as F:
    for line in F:
        line = line.rstrip().split(" ")
        key = str(line[:2])
        a_values = line[:2]
        value = float(line[4])
        b_values = line[2:4]
        if key not in dictionary:
            dictionary[key] = {"b_values":[b_values], "a_values":a_values}
            dictionary[key]["max_value"] = value
        else:
            if value < dictionary[key]["max_value"]:
                continue
            elif value > dictionary[key]["max_value"]:
                dictionary[key]["max_value"] = value
                dictionary[key]["b_values"] = [b_values]
                dictionary[key]["a_values"] = a_values
            else: #value = max_value
                dictionary[key]["b_values"].append(b_values)

for key in dictionary:
    for entry in dictionary[key]["b_values"]:
        print dictionary[key]["a_values"][0], dictionary[key]["a_values"][1], entry[0], entry[1], dictionary[key]["max_value"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM