简体   繁体   中英

How to add a new column with the median value of several existing columns in R (dplyr)?

I have a dataset with hundreds of columns and thousands of rows. A sample of this dataset is:

df <- read.table(header=TRUE, 
                 text="
    income_2007 occupation_2007 income_2008 occupation_2008 income_2009 occupation_2009 income_2010 occupation_2010 income_2011 occupation_2011 income_2012 occupation_2012 income_2013 occupation_2013 income_2014 occupation_2014
1   6117    XYZ 6339    XYZ 6240    XYZ 6274    XYZ 7834    XYZ 8722    XYZ 7770    XYZ 6849    XYZ
2   6774    XYZ 8168    XYZ 8225    XYZ 7944    XYZ 8442    XYZ 9048    XYZ 9031    XYZ 8630    XYZ
3   NA  NA  NA  NA  1390    XYZ NA  NA  2758    XYZ 4054    XYZ 5666    XYZ 5398    XYZ
4   2237    XYZ 3134    XYZ 5118    XYZ 4484    XYZ 4441    XYZ 5125    XYZ 5012    XYZ 4406    XYZ
5   NA  NA  NA  NA  2521    XYZ 2025    XYZ 3218    XYZ 4422    XYZ 5684    XYZ 6448    XYZ
6   9636    XYZ 8738    XYZ 8944    XYZ 10206   XYZ 11427   XYZ 13310   XYZ 12397   XYZ 13758   XYZ
7   5406    XYZ 6761    XYZ 8603    XYZ 8914    XYZ 9706    XYZ 9017    XYZ 9980    XYZ 10926   XYZ
8   1227    XYZ NA  NA  NA  NA  2302    XYZ 4246    XYZ 4798    XYZ 6175    XYZ 5978    XYZ
9   NA  NA  NA  NA  NA  NA  NA  NA  1547    XYZ 2631    XYZ 1931    XYZ 2488    XYZ
10  4876    XYZ 4438    XYZ 5844    XYZ 5884    XYZ 7532    XYZ 7739    XYZ 8630    XYZ 9573    XYZ
") 

In this example I have columns for the average income and main occupation for each year during 8 years (2007-2014).

I need to create two new columns. First column with the natural logarithm of the median income for the first four years (2007-2010). Second column with the natural logarithm of the median income for the last four years (2011-2014). The expected result is:

df1 <- read.table(header=TRUE, 
                 text="
income_2007 occupation_2007 income_2008 occupation_2008 income_2009 occupation_2009 income_2010 occupation_2010 income_2011 occupation_2011 income_2012 occupation_2012 income_2013 occupation_2013 income_2014 occupation_2014 ln_inc_07_10    ln_inc_11_14
1   6117    XYZ 6339    XYZ 6240    XYZ 6274    XYZ 7834    XYZ 8722    XYZ 7770    XYZ 6849    XYZ 8,741456116 8,96213539
2   6774    XYZ 8168    XYZ 8225    XYZ 7944    XYZ 8442    XYZ 9048    XYZ 9031    XYZ 8630    XYZ 8,994172434 9,085966917
3   NA  NA  NA  NA  1390    XYZ NA  NA  2758    XYZ 4054    XYZ 5666    XYZ 5398    XYZ 7,237059026 8,460834458
4   2237    XYZ 3134    XYZ 5118    XYZ 4484    XYZ 4441    XYZ 5125    XYZ 5012    XYZ 4406    XYZ 8,245121966 8,46094025
5   NA  NA  NA  NA  2521    XYZ 2025    XYZ 3218    XYZ 4422    XYZ 5684    XYZ 6448    XYZ 7,728855824 8,527737405
6   9636    XYZ 8738    XYZ 8944    XYZ 10206   XYZ 11427   XYZ 13310   XYZ 12397   XYZ 13758   XYZ 9,136693832 9,461371427
7   5406    XYZ 6761    XYZ 8603    XYZ 8914    XYZ 9706    XYZ 9017    XYZ 9980    XYZ 10926   XYZ 8,946635209 9,194515822
8   1227    XYZ NA  NA  NA  NA  2302    XYZ 4246    XYZ 4798    XYZ 6175    XYZ 5978    XYZ 7,475622643 8,591929538
9   NA  NA  NA  NA  NA  NA  NA  NA  1547    XYZ 2631    XYZ 1931    XYZ 2488    XYZ NA  7,700521525
10  4876    XYZ 4438    XYZ 5844    XYZ 5884    XYZ 7532    XYZ 7739    XYZ 8630    XYZ 9573    XYZ 8,586719254 9,009997401
") 

I would rather compute both new columns with dplyr, as this is the approach I am using already. What have I tried:

library(dplyr)
df1 <- df %>% mutate(ln_inc_07_10 = log(median(income_2007, income_2008, income_2009, income_2010)),
                     ln_inc_11_14 = log(median(income_2011, income_2012, income_2013, income_2014)))

However this computes the same value for every row. I need the median to be computed across columns.

Please notice that there are some missing values. Those should not be counted. And when there is an even number of columns with a valid income value, I would like to compute the median by taking the average of the two observations in the middle of the distribution.

We can use rowwise with the current code

library(dplyr)
df %>%
    rowwise %>% 
    mutate(ln_inc_07_10 = log(median(c(income_2007, income_2008, 
        income_2009, income_2010), na.rm = TRUE)),
          ln_inc_11_14 = log(median(c(income_2011, income_2012, 
        income_2013, income_2014), na.rm = TRUE))) %>%
    ungroup

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM