[英]Concatenate two columns of a dataframe in R if all values between them are NA?
I have a data frame that looks like this:我有一个看起来像这样的数据框:
> sample
# A tibble: 6 x 10
Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9 Supplier
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
1 1 2 3 4 8 NA NA NA NA orioles
2 1 2 3 4 9 13 NA NA NA nationals
3 1 2 3 5 10 14 16 18 NA dodgers
4 1 2 3 5 10 14 17 19 NA cardinals
5 1 2 3 6 11 NA NA NA NA giants
6 1 2 3 7 12 15 NA NA NA padres
What I'd like to do is concatenate the Supplier column with any Level column if all the values between them are NA
.如果它们之间的所有值都是
NA
,我想做的是将 Supplier 列与任何 Level 列连接起来。 Another way I was thinking about this was that if the column to the right of the Level column is NA
then to concatenate that column with the supplier column.我正在考虑的另一种方法是,如果 Level 列右侧的列是
NA
,那么将该列与供应商列连接起来。
I was thinking a for loop but I haven't figured out how to implement the logic.我在考虑一个 for 循环,但我还没有弄清楚如何实现逻辑。 The logic I was thinking is something like:
我在想的逻辑是这样的:
for (level in levels) {
if is.na(level n + 1) {
paste0(level, Supplier)
}
else {
level}
}
I could also do a bunch of mutate
calls like this but it seems super repetitive and unnecessary:我也可以像这样进行一堆
mutate
调用,但它似乎超级重复且不必要:
sample %>%
mutate(
Level_5 = ifelse(
is.na(Level_6),
paste0(Supplier, "<br>", Level_5),
Level_5)
)
Here's the dput of the data:这是数据的输入:
structure(list(Level_1 = c(1, 1, 1, 1, 1, 1), Level_2 = c(2,
2, 2, 2, 2, 2), Level_3 = c(3, 3, 3, 3, 3, 3), Level_4 = c(4,
4, 5, 5, 6, 7), Level_5 = c(8, 9, 10, 10, 11, 12), Level_6 = c(NA,
13, 14, 14, NA, 15), Level_7 = c(NA, NA, 16, 17, NA, NA), Level_8 = c(NA,
NA, 18, 19, NA, NA), Level_9 = c(NA, NA, NA, NA, NA, NA), Supplier = c("orioles",
"nationals", "dodgers", "cardinals", "giants", "padres")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
To be honest, I'm not 100% sure about your desired output.老实说,我不能 100% 确定您想要的 output。 Using
dplyr
and tidyr
:使用
dplyr
和tidyr
:
library(tidyr)
library(dplyr)
sample %>%
pivot_longer(cols=starts_with("Level_"), names_prefix="Level_", names_to="level") %>%
drop_na() %>%
group_by(Supplier) %>%
mutate(new_val=ifelse(level==max(level), paste0(Supplier, "<br>", value), value)) %>%
select(-value) %>%
pivot_wider(names_from=level, names_prefix="Level_", values_from=new_val)
returns返回
# A tibble: 6 x 9
# Groups: Supplier [6]
Supplier Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 orioles 1 2 3 4 orioles<br>8 NA NA NA
2 nationals 1 2 3 4 9 nationals<br>13 NA NA
3 dodgers 1 2 3 5 10 14 16 dodgers<br>18
4 cardinals 1 2 3 5 10 14 17 cardinals<br>19
5 giants 1 2 3 6 giants<br>11 NA NA NA
6 padres 1 2 3 7 12 padres<br>15 NA NA
I lost the Level_9 column since it contained only NA
.我丢失了 Level_9 列,因为它只包含
NA
。 You can easily add it again.您可以轻松地再次添加它。
Final Update I realized my mistake on trying to find the max value in every row and replace it with desired concatenated string.最终更新我意识到我试图在每一行中找到最大值并将其替换为所需的连接字符串的错误。 So I came up with another solution which only replaces the last non-
NA
value (it can also be not the max values of the row), given all values are not numeric.所以我想出了另一个解决方案,它只替换最后一个非
NA
值(它也可以不是该行的最大值),因为所有值都不是数字。 So here is my final solution:所以这是我的最终解决方案:
library(dplyr)
library(stringr)
library(purrr)
df %>%
pmap_dfr(., ~ {x <- c(...)[-10][!is.na(c(...)[-10])];
ind <- which(c(...) == x[length(x)]);
replace(c(...), ind[length(ind)], str_c(..10, x[length(x)], sep = "_"))}
)
# A tibble: 6 x 10
Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9 Supplier
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2 3 4 orioles_8 NA NA NA NA orioles
2 1 2 3 4 9 nationals_13 NA NA NA nationals
3 1 2 3 5 10 14 16 dodgers_18 NA dodgers
4 1 2 3 5 10 14 17 cardinals_19 NA cardinals
5 1 2 3 6 giants_11 NA NA NA NA giants
6 1 2 3 7 12 padres_15 NA NA NA padres
Another approach:另一种方法:
library(tidyr)
library(dplyr)
df %>% mutate(across(contains('Level'), ~ as.character(.))) %>%
mutate(across(contains('Level'), ~ coalesce(., Supplier))) %>% select(-Supplier) %>%
mutate(ID = row_number()) %>%
pivot_longer(cols = -ID) %>% group_by(ID) %>%
mutate(value = case_when(duplicated(value) ~ NA_character_, TRUE ~ value)) %>% pivot_wider(names_from = name, values_from = value) %>%
ungroup() %>% select(-ID)
# A tibble: 6 x 9
Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2 3 4 8 orioles NA NA NA
2 1 2 3 4 9 13 nationals NA NA
3 1 2 3 5 10 14 16 18 dodgers
4 1 2 3 5 10 14 17 19 cardinals
5 1 2 3 6 11 giants NA NA NA
6 1 2 3 7 12 15 padres NA NA
Very interesting question.非常有趣的问题。 Here is my approach without
pivot_longer
:这是我没有
pivot_longer
的方法:
library(dplyr)
# check is na
ind <- !is.na(df1)
# save vector who fullfill assumption value before first NA
values <- as.vector(tapply(df1[ind], row(df1)[ind], tail, 1))
# bind to dataframe
df2 <- cbind(df, values)
# accomplish the task
df2 %>%
mutate(across(Level_1:Level_9, ~ case_when(. == values ~ str_c(Supplier ,.),
. != values ~ as.character(.)))) %>%
select(-values)
Output: Output:
Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9 Supplier
1 1 2 3 4 orioles8 <NA> <NA> <NA> <NA> orioles
2 1 2 3 4 9 nationals13 <NA> <NA> <NA> nationals
3 1 2 3 5 10 14 16 dodgers18 <NA> dodgers
4 1 2 3 5 10 14 17 cardinals19 <NA> cardinals
5 1 2 3 6 giants11 <NA> <NA> <NA> <NA> giants
6 1 2 3 7 12 padres15 <NA> <NA> <NA> padres
Combined approach with c_across
and across
与
c_across
和 cross 相结合across
方法
library(tidyverse)
df %>% rowwise() %>%
mutate(dummy = max(which(!is.na(c_across(starts_with('Level')))))) %>% ungroup() %>%
mutate(across(starts_with('Level_'),
~ifelse(as.numeric(str_remove(cur_column(), 'Level_')) == dummy, paste(Supplier, ., sep = '_'), .)))
# A tibble: 6 x 11
Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9 Supplier dummy
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <lgl> <chr> <int>
1 1 2 3 4 orioles_8 NA NA NA NA orioles 5
2 1 2 3 4 9 nationals_13 NA NA NA nationals 6
3 1 2 3 5 10 14 16 dodgers_18 NA dodgers 8
4 1 2 3 5 10 14 17 cardinals_19 NA cardinals 8
5 1 2 3 6 giants_11 NA NA NA NA giants 5
6 1 2 3 7 12 padres_15 NA NA NA padres 6
Combining which
strategy used above my friend Anoushiravan's answer can be simplified to:结合我朋友 Anoushiravan 的回答上面使用
which
策略可以简化为:
purrr::pmap_dfr
do thesepurrr::pmap_dfr
里面做这些
n
n
i
i
temp
variable of length n
and having a T
at i
and F
elsewheren
的temp
变量,并在i
和F
处有一个T
replace
to replace i
th variable (using temp
) with desired valuesreplace
将第i
个变量(使用temp
)替换为所需的值df %>%
pmap_dfr(., ~ {n <- ncol(df); i <- max(which(!is.na(c(...)[-n])));
tmp <- rep(F, n); tmp[i] <- T;
replace(c(...), tmp, paste(c(...)[n], c(...)[i], sep = '_'))})
# A tibble: 6 x 10
Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9 Supplier
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2 3 4 orioles_8 NA NA NA NA orioles
2 1 2 3 4 9 nationals_13 NA NA NA nationals
3 1 2 3 5 10 14 16 dodgers_18 NA dodgers
4 1 2 3 5 10 14 17 cardinals_19 NA cardinals
5 1 2 3 6 giants_11 NA NA NA NA giants
6 1 2 3 7 12 padres_15 NA NA NA padres
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.