简体   繁体   中英

R - binned frequency for all data table columns

I am working with a data table that contains a distance matrix of meteorological stations (in metres). The distance from one station to the other ones is organized in each column of the data table.

Here's what the data looks like:

library(data.table)

df <- structure(list(`1` = c(NA, 183861.946148085, 455184.652723047,
84776.9412266853, 471037.870714541), `10` = c(183861.946148085,
NA, 323843.574006837, 268617.304324823, 456775.683807133), `100` = c(455184.652723047,
323843.574006837, NA, 527585.966416289, 757973.787598604), `1000` = c(84776.9412266853,
268617.304324823, 527585.966416289, NA, 498646.318627207), `1002` = c(471037.870714541,
456775.683807133, 757973.787598604, 498646.318627207, NA)), class = "data.frame", row.names = c(NA,
-5L))

dt <- setDT(df)

           1       10      100      1000     1002
1:        NA 183861.9 455184.7  84776.94 471037.9
2: 183861.95       NA 323843.6 268617.30 456775.7
3: 455184.65 323843.6       NA 527585.97 757973.8
4:  84776.94 268617.3 527586.0        NA 498646.3
5: 471037.87 456775.7 757973.8 498646.32       NA

where the first column shows the distances from station 1 to itself (zero) and to stations 10 , 100 , 1000 and 1002 . And henceforth, column-wise.

I am interested in finding out the number of stations that lie within 100, 150, 250 and 500 km from each station.

Hence, for each column I would like to determine the count (and proportion) of values lying within breaks <- c(100000, 150000, 250000, 500000) .

The expected output would be:

ID      distance count prop
1       0-100000 1       25
1  100000-150000 0        0
1  150000-250000 1       25
1  250000-500000 2       50
    ...

My real dataset has dimensions 1500 x 1500, so what is the fastest way to achieve the expected result?

Since your dataset is symmetric, we can only consider either upper/lower triangle.

library(dplyr)
df[upper.tri(df)] <- NA

We can get the data in long format, cut the values based on breaks , calculate the count and proportion for each Id.

df %>%
  tidyr::pivot_longer(cols = everything(), 
                      values_drop_na = TRUE) %>%
  count(name, value = cut(value, c(-Inf, breaks, Inf), c('0-100000',
         '100000-150000', '150000-250000', '250000-500000', '500000+'))) %>%
  group_by(name) %>%
  mutate(prop = n/sum(n) * 100)

#  name  value             n  prop
#  <chr> <fct>         <int> <dbl>
#1 1     0-100000          1    25
#2 1     150000-250000     1    25
#3 1     250000-500000     2    50
#4 10    250000-500000     3   100
#5 100   500000+           2   100
#6 1000  250000-500000     1   100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM