简体   繁体   English

用另一数据框R的行替换一列中每次出现的因子变量

[英]Replace every occurrence of factor variable in one column with row from another dataframe R

Say I have two dataframes. 说我有两个数据框。 One is my 'main' df and the other is the one I'm using to replace values in the main df. 一个是我的“主” df,另一个是我用来替换主df中的值的那个。

So in column cd of dfMain , every time the factor level orange comes up I want to replace this with the corresponding row from dfReplace (which has a rowname called orange ) 因此,在dfMain cd列中,每次出现orange因子水平时,我都希望将其替换为dfReplace的相应行(其行dfReplaceorange

This will result in dfMain gaining 3 columns in width because the cd column goes away and it gains columns X1, X2, X3, X4 这将导致dfMain宽度增加3列,因为cd列消失并且它获得了X1, X2, X3, X4

The key here is that I need this to be as efficient as possible because my actual data is much, much longer 这里的关键是我需要尽可能提高效率,因为我的实际数据要长得多

Reproducible example: 可重现的示例:

set.seed(42)
dfMain <- data.frame('av' = sample.int(10, 100, replace = TRUE), 
                     'ba' = sample.int(10, 100, replace = TRUE), 
                     'cd' = sample(c('orange', 'apple', 'banana', 'strawberry', 'blueberry', 'blackberry'), 100, replace = TRUE))

dfReplace <- data.frame('X1' = runif(6),
                        'X2' = runif(6),
                        'X3' = runif(6),
                        'X4' = runif(6))
rownames(dfReplace) <- c('orange', 'apple', 'banana', 'strawberry', 'blueberry', 'blackberry')

I'd suggest first converting the rownames to an explicit table field and converting the cd factor to character, and then doing a database join, which should be very fast. 我建议先将行名转换为显式表字段,然后将cd因子转换为字符,然后再进行数据库联接,这应该非常快。

library(dplyr)
dfReplace2 <- dfReplace %>%
  add_rownames(var = "cd")

dfMain %>%
  mutate(cd = as.character(cd)) %>%
  left_join(dfReplace2)

I left the original 'cd' field there, but could be removed with %>% select(-cd) . 我把原始的“ cd”字段留在那里,但是可以用%>% select(-cd)删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM