Suppose the content of file.dat is the following:
#a1 a2 b1 b2 c
2.0 1.0 1.0 2.0 0.3
1.0 1.0 1.0 2.0 0.9
1.0 2.0 1.0 2.0 0.6
3.0 3.0 3.0 2.0 0.6
1.0 3.0 1.0 2.0 0.87
2.0 1.0 3.0 2.0 0.9
3.0 1.0 3.0 2.0 0.85
1.0 3.0 1.0 2.0 0.89
1.0 3.0 3.0 2.0 0.7
2.0 1.0 3.0 2.0 0.5
3.0 1.0 2.0 2.0 0.7
1.0 1.0 3.0 2.0 0.88
3.0 2.0 1.0 2.0 0.3
2.0 2.0 1.0 2.0 0.5
2.0 2.0 3.0 2.0 0.8
2.0 3.0 1.0 2.0 0.3
3.0 1.0 3.0 2.0 0.83
1.0 2.0 1.0 2.0 0.3
2.0 3.0 2.0 2.0 0.3
3.0 3.0 3.0 2.0 0.6
1.0 1.0 2.0 2.0 0.8
2.0 3.0 3.0 2.0 0.7
2.0 2.0 3.0 2.0 0.85
1.0 2.0 3.0 2.0 0.81
3.0 2.0 1.0 2.0 0.9
3.0 2.0 3.0 2.0 0.82
3.0 3.0 3.0 2.0 0.84
I want to create a reduced data file including (a1, a2, b1, b2, c) such that for each class of (a1, a2), ie all 3 cases of a1=1, a=1 for example, it returns only the values of those (b1, b2) in which c has its maximum value in this class of (a1, a2) and delete the other rows that have non-maximum "c" values. For example, for the class of (a1, a2)=(1.0, 1,0) the value of maximum c is 0.9. In this case b1=1.0 and b2=2.0 get returned. I want the output of this simple example to be the following.
#a1 a2 b1 b2 c
1.0 1.0 1.0 2.0 0.9
1.0 2.0 3.0 2.0 0.8
1.0 3.0 1.0 2.0 0.9
2.0 1.0 3.0 2.0 0.9
2.0 2.0 3.0 2.0 0.85
2.0 3.0 3.0 2.0 0.8
3.0 1.0 3.0 2.0 0.85
3.0 2.0 1.0 2.0 0.9
3.0 3.0 3.0 2.0 0.8
I want to learn how to do this in R (or preferably in Numpy (python)). Any help is extremely appreciated. I know which.max() might help but honestly I don't know how should I apply this. I am VERY new in R programming as well as Numpy.
You can use a combination of by
and do.call
functions:
res <- do.call(rbind, by(data,INDICES=list(data$a1,data$a2),FUN=function(x){x[x$c == max(x$c),]}))
res
# > res
# a1 a2 b1 b2 c
# 2 1 1 1 2 0.90
# 6 2 1 3 2 0.90
# 7 3 1 3 2 0.85
# 24 1 2 3 2 0.81
# 23 2 2 3 2 0.85
# 25 3 2 1 2 0.90
# 8 1 3 1 2 0.89
# 22 2 3 3 2 0.70
# 27 3 3 3 2 0.84
with data
being your input data.frame
.
In this example data
is equal to this:
data <-
read.csv(sep=',',text=
"a1,a2,b1,b2,c
2.0,1.0,1.0,2.0,0.3
1.0,1.0,1.0,2.0,0.9
1.0,2.0,1.0,2.0,0.6
3.0,3.0,3.0,2.0,0.6
1.0,3.0,1.0,2.0,0.87
2.0,1.0,3.0,2.0,0.9
3.0,1.0,3.0,2.0,0.85
1.0,3.0,1.0,2.0,0.89
1.0,3.0,3.0,2.0,0.7
2.0,1.0,3.0,2.0,0.5
3.0,1.0,2.0,2.0,0.7
1.0,1.0,3.0,2.0,0.88
3.0,2.0,1.0,2.0,0.3
2.0,2.0,1.0,2.0,0.5
2.0,2.0,3.0,2.0,0.8
2.0,3.0,1.0,2.0,0.3
3.0,1.0,3.0,2.0,0.83
1.0,2.0,1.0,2.0,0.3
2.0,3.0,2.0,2.0,0.3
3.0,3.0,3.0,2.0,0.6
1.0,1.0,2.0,2.0,0.8
2.0,3.0,3.0,2.0,0.7
2.0,2.0,3.0,2.0,0.85
1.0,2.0,3.0,2.0,0.81
3.0,2.0,1.0,2.0,0.9
3.0,2.0,3.0,2.0,0.82
3.0,3.0,3.0,2.0,0.84")
In R, you could use dplyr
for that task. For each group of a1
and a2
pairs, it will filter (return) only those rows where c
equals the maximum c
for that group. Note that this may result in several rows per group. I include another example if you only need one row per group.
require(dplyr) #install the package and load it into your library
dat %.% #if `dat` is your input data.frame
group_by(a1, a2) %.%
filter(c == max(c))
# a1 a2 b1 b2 c
#1 1 1 1 2 0.90
#2 1 3 1 2 0.80
#3 2 1 3 2 0.90
#4 3 1 3 2 0.85
#5 1 3 1 2 0.80
#6 2 3 3 2 0.70
#7 2 2 3 2 0.85
#8 1 2 3 2 0.80
#9 3 2 1 2 0.90
#10 3 3 3 2 0.80
dat %.%
group_by(a1, a2) %.%
filter(c == max(c)) %.%
filter(1:n() == 1) #this will make sure you only get the first row of each group
Assuming your data is in a data.frame named dd
(something like this)
#sample data
dd <- structure(list(a1 = c(2, 1, 1, 3, 1, 2, 3, 1, 1, 2, 3, 1, 3,
2, 2, 2, 3, 1, 2, 3, 1, 2, 2, 1, 3, 3, 3), a2 = c(1, 1, 2, 3,
3, 1, 1, 3, 3, 1, 1, 1, 2, 2, 2, 3, 1, 2, 3, 3, 1, 3, 2, 2, 2,
2, 3), b1 = c(1, 1, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 1, 3, 1,
3, 1, 2, 3, 2, 3, 3, 3, 1, 3, 3), b2 = c(2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
c = c(0.3, 0.9, 0.6, 0.6, 0.8, 0.9, 0.85, 0.8, 0.7, 0.5,
0.7, 0.8, 0.3, 0.5, 0.8, 0.3, 0.8, 0.3, 0.3, 0.6, 0.8, 0.7,
0.85, 0.8, 0.9, 0.8, 0.8)), .Names = c("a1", "a2", "b1",
"b2", "c"), class = "data.frame", row.names = c(NA, -27L))
then you can then use ave
dd[with(dd, ave(c,a1,a2,FUN=function(x) x==max(x)))==1, ]
to subset to the max value from each a1/a2 group to get
a1 a2 b1 b2 c
2 1 1 1 2 0.90
5 1 3 1 2 0.80
6 2 1 3 2 0.90
7 3 1 3 2 0.85
8 1 3 1 2 0.80
22 2 3 3 2 0.70
23 2 2 3 2 0.85
24 1 2 3 2 0.80
25 3 2 1 2 0.90
27 3 3 3 2 0.80
If actually wanting to use python try the following:
dictionary = {}
with open("input.dat", "r") as F:
for line in F:
line = line.rstrip().split(" ")
key = str(line[:2])
a_values = line[:2]
value = float(line[4])
b_values = line[2:4]
if key not in dictionary:
dictionary[key] = {"b_values":[b_values], "a_values":a_values}
dictionary[key]["max_value"] = value
else:
if value < dictionary[key]["max_value"]:
continue
elif value > dictionary[key]["max_value"]:
dictionary[key]["max_value"] = value
dictionary[key]["b_values"] = [b_values]
dictionary[key]["a_values"] = a_values
else: #value = max_value
dictionary[key]["b_values"].append(b_values)
for key in dictionary:
for entry in dictionary[key]["b_values"]:
print dictionary[key]["a_values"][0], dictionary[key]["a_values"][1], entry[0], entry[1], dictionary[key]["max_value"]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.