简体   繁体   中英

R: How to create a conditional column indirectly based on a non-static amount of other columns, in a data.table

I have the following data.table :

     Name    x    y   h 120Hz 800Hz 1000Hz 1200Hz
1: Tower1 1354  829 245     0     8      7      0
2: Tower2 2654  234 285     7     0      3      0
3: Tower3  822 3040 256     0     4      0      9
4: Tower4  987 2747 250     0     6      5      3
5: Tower5 1953 1739 301     0     0      8      2

You can create it with:

DT <- data.table(Name = c("Tower1", "Tower2", "Tower3", "Tower4", "Tower5"),
                 x = c(1354,2654,822,987,1953),
                 y = c(829,234,3040,2747,1739),
                 h = c(245,285,256,250,301),
                 `120Hz` = c(0,7,0,0,0),
                 `800Hz` = c(8,0,4,6,0),
                 `1000Hz` = c(7,3,0,5,8),
                 `1200Hz` = c(0,0,9,3,2))

In reality, it came from a previous, larger data.table . The last four columns were auto-generated from that other data.table using dcast , so there is no way to know beforehand the number or the names of the columns after column h. This is important.

The final goal is to create another column named "Range", whose value for each row depends on the values in the columns after column "h", as it follows:

Consider the following associations between frequencies and ranges. These are the only stablished associations and are static, so this information could be stored as a pre-defined data.table .

assoc <- data.table(Frq = c("800Hz", "1000Hz", "1200Hz"),
                    Rng = c(750,850,950))

For each one of the four columns after column "h", the code should check if the column name exists in assoc . If so, AND if the value in that column for the row in question in DT is NOT zero, then the code considers the respective Rng value (from assoc ). After checking all four columns, the code should return the MAXIMUM of the ranges considered and store in the new column "Range".


My approach:

Create one auxiliar column for each frequency column:

DT <- DT[, paste0(colnames(DT)[5:ncol(DT)],'_r') := 0]

Then I could use a conditional structure that does the algorithm described above. Let's take for example column 800Hz_r. This column checks the value in column 800Hz. If that value is not zero for the row in question, then it returns 750. At the end, the column Range simply takes the maximum of the previous 4 columns, the ones ending with _f. There's where I'm stuck, I can't find an useful command to do so. Everything I've tried throws me some error.

Finally, the auxiliary _f columns should be deleted. If anyone knows a way to do it without creating auxiliar columns it would be much better.

This is the expected result (prior to deletion of auxiliary columns):

     Name    x    y   h 120Hz 800Hz 1000Hz 1200Hz 120Hz_f 800Hz_f 1000Hz_f 1200Hz_f Range
1: Tower1 1354  829 245     0     8      7      0       0     750      850        0    850
2: Tower2 2654  234 285     7     0      3      0       0       0      850        0    850
3: Tower3  822 3040 256     0     4      0      9       0     750        0      950    950
4: Tower4  987 2747 250     0     6      5      3       0     750      850      950    950
5: Tower5 1953 1739 301     0     0      8      2       0       0      850      950    950

NOTE: The reason why there could be frequency columns that don't appear in assoc is because the original data could have typos. In this example, the column 120Hz would always generate only zeros in column 120Hz_f and thus it can never get to be considered for the maximum Range. That's ok.

A back and forth to long format can make this work:

dcast(melt(DT, measure.vars=patterns("Hz$"))[assoc, on = c(variable = 'Frq')
                                                  , Rng := i.Rng * (value != 0)],
      Name + x + y + h ~ variable, max, value.var='Rng')[,
  do.call(function(...) pmax(..., na.rm = T), .SD), .SDcols = `120Hz`:`1200Hz`]
#[1] 850 850 950 950 950

Or you can avoid creating the intermediate columns if you loop over assoc :

DT[, Range := -Inf]

assoc[, {DT[, Range := pmax(Range, (get(Frq) != 0) * Rng)]; NULL}, by = Frq]

DT
#     Name    x    y   h 120Hz 800Hz 1000Hz 1200Hz Range
#1: Tower1 1354  829 245     0     8      7      0   850
#2: Tower2 2654  234 285     7     0      3      0   850
#3: Tower3  822 3040 256     0     4      0      9   950
#4: Tower4  987 2747 250     0     6      5      3   950
#5: Tower5 1953 1739 301     0     0      8      2   950

It is not exactly as you intend but my motto is when the algorithm does not fit the data, then format the data to the algorithm.

A bit long but simple to implement.

I melt DT with the following code and use the convert the Hz into numeric with removing the "Hz" and converting into numeric.

a <- melt(DT,id.vars=1:4)[value>0][,crit:=as.numeric(gsub("Hz","",variable))]

to get something like:

##> a
##      Name    x    y   h variable value crit
## 1: Tower1 1354  829 245    800Hz     8  800
## 2: Tower1 1354  829 245   1000Hz     7 1000
## 3: Tower2 2654  234 285    120Hz     7  120
## 4: Tower2 2654  234 285   1000Hz     3 1000
## 5: Tower3  822 3040 256    800Hz     4  800
## 6: Tower3  822 3040 256   1200Hz     9 1200
## 7: Tower4  987 2747 250    800Hz     6  800
## 8: Tower4  987 2747 250   1000Hz     5 1000
## 9: Tower4  987 2747 250   1200Hz     3 1200
## 10: Tower5 1953 1739 301   1000Hz     8 1000
## 11: Tower5 1953 1739 301   1200Hz     2 1200

Then find the max by Tower.

## > a[,.(crit=max(crit)),by=Name]
##    Name crit
## 1: Tower1 1000
## 2: Tower2 1000
## 3: Tower3 1200
## 4: Tower4 1200
## 5: Tower5 1200

Then merge it back with a

b <- merge(setkey(a,Name,crit),setkey(a[,.(crit=max(crit)),by=Name],Name,crit))

To get something like

## > b
## Name crit    x    y   h variable value
## 1: Tower1 1000 1354  829 245   1000Hz     7
## 2: Tower2 1000 2654  234 285   1000Hz     3
## 3: Tower3 1200  822 3040 256   1200Hz     9
## 4: Tower4 1200  987 2747 250   1200Hz     3
## 5: Tower5 1200 1953 1739 301   1200Hz     2

Then merge b with assoc

## > merge(b,assoc,by.x="variable",by.y="Frq")
## variable   Name crit    x    y   h value Rng
## 1:   1000Hz Tower1 1000 1354  829 245     7 850
## 2:   1000Hz Tower2 1000 2654  234 285     3 850
## 3:   1200Hz Tower3 1200  822 3040 256     9 950
## 4:   1200Hz Tower4 1200  987 2747 250     3 950
## 5:   1200Hz Tower5 1200 1953 1739 301     2 950

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM