简体   繁体   中英

what does the trim argument stands for in the mean() function?

I just can't understand the concept of trim. I thought it was rounding the numbers at first but this doesn't make sense. Can anyone clarify what trim is doing here?

# The linkedin and facebook vectors have already been created for you
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)

# Calculate the mean of the sum
avg_sum <- mean(c(linkedin+facebook))

# Calculate the trimmed mean of the sum
avg_sum_trimmed <- mean(c(linkedin+facebook), trim = 0.2)

# Inspect both new variables
avg_sum
[1] 22.28571
avg_sum_trimmed
[1] 22.6

I'm placing two mean functions, one with and the other without the trim argument. Any comments on how to clarify this concept is welcome.

According to ?mean

trim -The fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

If we use the vector 'v1'

v1 <- c(linkedin + facebook)

with length 7, the sort ed values would be

v2 <- sort(v1)

Removing 20% of the observations from either end (of the sort ed vector would be roughly 1st and last observation being removed

mean(v2[-c(1, 7)])
#[1] 22.6

which is equal to

mean(v1, trim = 0.2)
#[1] 22.6

-checking with trim = 0.4

mean(v2[-c(1:2, 6:7)])
#[1] 22.33333
mean(v1, trim = 0.4)
#[1] 22.33333

The code you show looks like an example from Intermediate R from Datacamp. Unfortunately, the class offers no further explanation of what a trimmed mean does nor when you should actually use it. I also found myself quite loss with why we should use it. Here's what I found:

First of all, a trimmed mean is a robust estimator of central tendency. It's computation is quite simple since you only have to 1) remove a predetermined amount of observations on each side of a distribution and then 2) average the remaining observations. In this way, by getting rid of some observation at each side of an asymmetric distribution, the trimmed mean estimation of the bulk of the observations is quite better and its standard error is less affected by outliers (in contrast with the 'traditional' mean).

Let's see the Datacamp example you provided:

linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)

If you add them

link_and_fb <- linkedin+facebook

#You get
> link_and_fb
[1] 33 16 18 21 10 30 28

Now remember that you wanted a 0.2 trimmed mean. Before doing that R sorts your vector

sorted <- sort(link_and_fb)
> sorted
[1] 10 16 18 21 28 30 33

Given that you have 7 observations (0.2*7 = 1.4), you will remove 1.4 observations from each side of the distribution. Thus, you'll get rid of 10 and 33, and then divide the sum of the remaining observations by 5

(16+18+21+28+30)/5 = 22.6

#Which is what you get with
mean(c(linkedin+facebook), trim = 0.2)
[1] 22.6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM