簡體   English   中英

如何在 R 的數據框中找到每個觀察值的百分位數?

[英]How to find the percentile for each observation in a data frame in R?

假設我們有一個簡單的數據框:

structure(c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11), dim = c(6L, 
2L), dimnames = list(NULL, c("lo", "li")))

如何找到兩個變量的每個觀察值的百分位數?

最 R 友好的方法是(i)將其轉換為 dataframe(或 tibble),(ii)將數據重塑為長格式,(iii)groupby lo 和 li,以及(iv)計算百分比排名。

這是代碼:

df%>%
  as_tibble() %>% # convert to dataframe
  gather(key=variable,value=value) %>% # gather into long form
  group_by(variable)%>%. # group by lo and li
  mutate(percentile=percent_rank(val)*100) # make new column

variable   val percentile
   <chr>    <dbl>      <dbl>
 1 lo           2         20
 2 lo           4         40
 3 lo           5         60
 4 lo           6         80
 5 lo           8        100
 6 lo           1          0
 7 li           2          0
 8 li           4         20
 9 li           6         40
10 li          67        100
11 li           8         60
12 li          11         80

如果你不想讓 dataframe 變長,只需將兩列分開:

df%>%
  as_tibble()%>%
  mutate(lo_pr=percent_rank(lo)*100)%>%
  mutate(li_percentile=percent_rank(li)*100)


lo    li lo_pr li_percentile
  <dbl> <dbl> <dbl>         <dbl>
1     2     2    20             0
2     4     4    40            20
3     5     6    60            40
4     6    67    80           100
5     8     8   100            60
6     1    11     0            80

這是一個dplyr方法來獲得中位數、5% 和 95% 分位數。

library(tidyverse)
data = structure(c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11), dim = c(6L, 
                                                           2L), dimnames = list(NULL, c("lo", "li")))

data %>% 
  as.data.frame() %>% # Coerce to dataframe
  pivot_longer(cols = everything()) %>%  # Pivot to long format
  group_by(name) %>% # For each unique group..
  summarise(perc5 = quantile(value, 0.05), # Calculate 5% quantile
            median = median(value), # Calculate median
            perc95 = quantile(value, 0.95)) # Calculate 95% quantile

#> # A tibble: 2 × 4
#>   name  perc5 median perc95
#>   <chr> <dbl>  <dbl>  <dbl>
#> 1 li     2.5     7     53  
#> 2 lo     1.25    4.5    7.5

創建於 2023-01-27,使用reprex v2.0.2

data.table 解決方案

library(data.table)

data <- data.table(data)

q <- c(0.05, 0.95)

melt(data, measure.vars = names(data))[, setNames(as.list(quantile(value, q)), paste("q", q * 100, sep = "_")), variable]

結果

variable  q_5 q_95
1:       lo 1.25  7.5
2:       li 2.50 53.0

數據

data = structure(
  c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11),
  dim = c(6L, 2L), 
  dimnames = list(NULL, c("lo", "li"))
)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM