简体   繁体   English

在 R 中的数据帧上使用 approx 保留类型因子的列

[英]keep columns of type factor using approx on a data frame in R

I have a big dataframe with a lot of columns.我有一个很大的 dataframe 有很多列。 Some of them are of type double and others are of the type factor.其中一些是 double 类型,另一些是类型因子。 I resample the dataframe by adding a new column "time" with the approx function and the method = "constant".我通过添加一个新列“时间”与大约 function 和方法 =“常量”来重新采样 dataframe。 After that all factor columns are changed to doubles.之后,所有因子列都更改为双精度列。

For example:例如:

So my idea looks like this:

time = seq(1, 6, by = 0.1)

df1 <- data.frame(ecuTime = c(2, 4, 6), a = as.factor(c("male", "female", 
                                                   "male")), b = c(1, 3, 5))

df2 <- data.frame(ecuTime = c(1, 3.2, 3.4, 6), c = as.factor(c("car", "car", 
                                                    "bike", "car")), d = c(2, 3, 5, 6))

dfComb <- merge(df1, df2, by = "ecuTime", all = TRUE)

approxData <- cbind.data.frame(time, sapply(dfComb[, names(dfComb)], 
                                        function(y, x, nout) 
                                        approx(x, y, nout, method = "constant", na.rm = FALSE)$y,
                                        x = dfComb$ecuTime, nout = time))

Is it possible to keep the factor columns as factors and the columns of type double as doubles even if I use the function approx?即使我大约使用 function,是否可以将因子列保持为因子,并将类型为 double 的列保持为两倍?

Edit: I found out that it doesn't make sense to use the approx function on factors and don't want to use na.rm = TRUE because I have a lot of NA's in some columns and if I replace them with previous values there will be a really big difference to the original data regarding the distributions etc. Is there an alternative Solution to just do the approx function for non factor columns and then merge it with the original factor columns?编辑:我发现在因子上使用大约 function 并不想使用 na.rm = TRUE 是没有意义的,因为我在某些列中有很多 NA,如果我用以前的值替换它们与分布等方面的原始数据有很大的不同。是否有替代解决方案只为非因子列执行大约 function,然后将其与原始因子列合并? I think it makes sense to not fill up the factor columns with prior values and only use the original values connected with the resampled time like 0.1, 0.2 etc. After that it could be merged.我认为不使用先前值填充因子列并仅使用与重新采样时间相关的原始值(如 0.1、0.2 等)是有意义的。之后可以合并它。

I am just confused how to combine df1 and df2 with a resampled time frequency so my distributions and line plots are completely different to the original data.我只是很困惑如何将 df1 和 df2 与重新采样的时间频率结合起来,所以我的分布和线图与原始数据完全不同。 My final goal I want to achieve is to make some comparison of some specific factors in a specific time frame.我想要实现的最终目标是在特定时间范围内对一些特定因素进行一些比较。 So I can't compare different variables because another one might be NA.所以我无法比较不同的变量,因为另一个变量可能是 NA。

So, I'm not clear on the big picture of what you're trying to get done here, which is fine;因此,我不清楚您要在这里完成什么,这很好; I understand the specific question well enough.我很好地理解了具体问题。 However, I'm trusting that you're really, really sure this is a good idea -- at face value, I'd be pretty worried about doing something resembling arithmetic via the approx() function on the underlying integers of a factor variable (which are totally meaningless).但是,我相信您真的非常确定这是一个好主意 - 从表面上看,我非常担心通过approx() function 对因子变量的基础整数进行类似算术运算(这完全没有意义)。 It seems to me like there is probably a "better" (ie less hacky) way to get this done, but I'm not in a position to help you do that since your overall goals aren't clear to me.在我看来,可能有一种“更好”(即不那么老套)的方式来完成这项工作,但我不会在 position 中帮助您做到这一点,因为您的总体目标对我来说并不明确。

That said, here's one possible road map to do what you want using base R:也就是说,这是一条可能的道路 map 使用base R 做你想做的事:

  • identify which variables should be factors确定哪些变量应该是因素
  • inside approxData , convert those variables back into factor typeapproxData中,将这些变量转换回因子类型
  • remap the levels of the new factor variables based on the corresponding values from df根据df中的相应值重新映射新因子变量的levels

Code, expanded with an extra factor column (to verify that it runs properly in the case with more than one factor variable):代码,扩展了一个额外的因子列(以验证它在具有多个因子变量的情况下是否正常运行):

time = 1:6
df <- data.frame(ecuTime = c(2, 4, 6), a = as.factor(c("male", "female", 
                                                       "male")), b = c(1, 3, 5),
                 c = c("blue", "blue", "yellow"))
str(df)

approxData <- cbind.data.frame(time, sapply(df[, names(df)], 
                                            function(y, x, nout) 
                                              approx(x, y, nout, method = "constant")$y,
                                            x = df$ecuTime, nout = time))
str(approxData)

factor_vars <- names(df[, sapply(df, is.factor)])
approxData[, factor_vars] <- 
  lapply(factor_vars, function(x) {
    approxData[[x]] <- factor(approxData[[x]]); 
    levels(approxData[[x]]) <- levels(df[[x]]); 
    approxData[[x]]
  })

str(approxData)

For the edited question: here's some code to produce a new data frame, dfComb_resample .对于已编辑的问题:这里有一些代码可以生成一个新的数据框dfComb_resample This data frame has an expanded ecuTime variable, values for a, b, c, d copied from df1 and df2 where appropriate, and NA values everywhere else.此数据帧有一个扩展的ecuTime变量, a, b, c, d值从df1df2复制(如果适用),以及其他任何地方的NA值。 (If I missed the mark on what you wanted, let me know.) (如果我错过了你想要的标记,请告诉我。)

time = seq(1, 6, by = 0.1)

df1 <- data.frame(ecuTime = c(2, 4, 6), a = as.factor(c("male", "female", 
                                                        "male")), b = c(1, 3, 5))

df2 <- data.frame(ecuTime = c(1, 3.2, 3.4, 6), c = as.factor(c("car", "car", 
                                                               "bike", "car")), d = c(2, 3, 5, 6))

dfComb_resample <- 
  Reduce(function(x, y) merge(x=x, y=y, by = "ecuTime", all = TRUE),
         list(data.frame(ecuTime = time), df1, df2))

How it works: Reduce() is a shortcut to merge three (or more) data frames at a time in this context.工作原理: Reduce()是在这种情况下一次合并三个(或更多)数据帧的快捷方式。 Note that you'd get some unexpected behavior if any of the merged data frames had variables in common, which they don't in this example.请注意,如果任何合并的数据框有共同的变量,您会得到一些意想不到的行为,而在本示例中它们没有。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有一种使用 R 大约 function 的方法,其中插值的值取决于数据框中的其他列? - Is there a way of using the R approx function where the values to interpolate over are dependent on other columns in the data frame? 使用多个条件的 r 数据帧中的分隔因子和字符列 - separating factor and character columns in r data frame using multiple condition 如何使用R基于匹配查找数据框替换数据框多列中的因子水平 - How to replace factor levels in multiples columns of a data frame based on the match lookup data frame using R 使用Rcpp将因子列转换为R data.frame中的日期列 - Transform factor columns to date columns in a R data.frame using Rcpp R数据帧的所有列(因子)列都转换为十进制 - R data frame all columns (factor) columns to decimal 使用apply()函数更新R中数据帧的多列的因子级别 - Using apply() function to update the factor levels of multiple columns of a data frame in R 在R数据框中的选定因子列中将NA更改为“ N” - Changing NA to “N” in selected factor columns in R data frame 对R中的分组数据使用近似函数 - Using approx function on grouped data in R R:按因子细分数据 - R:Subsetting data frame by factor R-使用for循环修改数据帧中的列(并使用因子级别) - R - using a for loop to modify a column in a data frame (and using factor levels)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM