简体   繁体   English

R:高于和低于基准的值填0

[英]R: Fill in 0 in values that are above and below a benchmark

I'm currently writing my master thesis and when I made a regression I found out that I have some outliers which I would like to either delete or fill in a zero.我目前正在写我的硕士论文,当我进行回归时,我发现我有一些异常值,我想删除或填写零。 I got a dataframe with company names and their daily returns from 2010 until 2021.我得到了一个 dataframe 公司名称及其从 2010 年到 2021 年的每日回报。 在此处输入图像描述

The dataframe is called xsr. dataframe 称为 xsr。 I want to find the outliers which are above 0.5 and below -0.5.我想找到高于 0.5 且低于 -0.5 的异常值。 I managed to create a dataframe according to this condition xsr_short <- xsr[,c(2:214)] <0.5 .我设法根据这个条件创建了一个 dataframe xsr_short <- xsr[,c(2:214)] <0.5 Then I tried to pick the false values outliers <- subset(xsr_short, xsr_short = FALSE) .然后我尝试选择错误值outliers <- subset(xsr_short, xsr_short = FALSE) Which just gives me back the initial xsr_short .这只是给了我最初的xsr_short

I also tried it with the select command: xsr_short <- select(xsr, c('ABBN SW Equity':'ZWM SW Equity') < 0.5) .我还使用select命令进行了尝试: xsr_short <- select(xsr, c('ABBN SW Equity':'ZWM SW Equity') < 0.5) The output to this is: output 到这个是:

    Error in `select()`:
! NA/NaN argument
Backtrace:
  1. dplyr::select(xsr, c("ABBN SW Equity":"ZWM SW Equity") < 0.5)
 22. base::.handleSimpleError(`<fn>`, "NA/NaN argument", base::quote("ABBN SW Equity":"ZWM SW Equity"))
 23. rlang (local) h(simpleError(msg, call))
 24. handlers[[1L]](cnd)
Warning messages:
1: In eval_tidy(expr, context_mask) : NAs introduced by coercion
2: In eval_tidy(expr, context_mask) : NAs introduced by coercion

I need to fill in the second condition > -0.5 and then delete the values that are out of this range.我需要填写第二个条件 > -0.5,然后删除超出此范围的值。

Thank you very much in advance for your help and your time!非常感谢您的帮助和时间!

It seems like you are less concerned with an actual subset but rather just switching out unwanted values in your data while preserving what you have for the regression.似乎您不太关心实际的子集,而只是在保留用于回归的内容的同时切换数据中不需要的值。 In that case, the tidyverse package may be helpful.在这种情况下, tidyverse package 可能会有所帮助。 First, you can load this package as well as this fake dataset:首先,您可以加载这个 package 以及这个假数据集:

#### Load Tidyverse ####
library(tidyverse)

#### Make Data Frame ####
data <- data.frame(IV = c("Control","Treatment",
                          "Control","Treatment"),
                   DV = c(-9999,2,4,5555))
data

Which gives you this:这给了你这个:

         IV    DV
1   Control -9999
2 Treatment     2
3   Control     4
4 Treatment  5555

From there you can simply use mutate and ifelse to remove the unwanted values and replace then with NA missing values with this code, saving the data into a new version with the replacement values:从那里您可以简单地使用mutateifelse删除不需要的值,然后用此代码替换 NA 缺失值,将数据保存到具有替换值的新版本中:

#### Swap Outliers with NA Values ####
clean.data <- data %>% 
  mutate(DV = ifelse(DV < 0,
                     NA,
                     ifelse(DV > 100,
                            NA,
                            DV)))
clean.data

Which gives you this:这给了你这个:

       IV DV
1   Control NA
2 Treatment  2
3   Control  4
4 Treatment NA

As some others have noted, its generally bad practice to delete outliers in your data unless you have a defensible reason to do so.正如其他一些人所指出的那样,删除数据中的异常值通常是不好的做法,除非您有正当理由这样做。 So if you do remove them, make sure you have something justifiable to include in your thesis that explains why you removed the values.因此,如果您确实删除了它们,请确保您在论文中包含一些合理的内容,以解释您删除这些值的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM