[英]Create yes/no column based on values in two other columns
I have a dataset that looks like this:我有一个看起来像这样的数据集:
df <- structure(list(ID = 1:10, Region1 = c("Europe", "NA",
"Asia", "NA", "Europe", "NA", "Africa", "NA", "Europe", "North America"), Region2 = c("NA", "Europe",
"NA", "NA", "NA", "Europe",
"NA", "NA", "NA", "NA"
)),
class = "data.frame", row.names = c(NA, -10L))
I want to create a new column called EuropeYN
which is either yes or no depending on whether EITHER of the region columns ( region1
or region2
) include "Europe".我想创建一个名为
EuropeYN
的新列,根据区域列( region1
或region2
)中的任一个是否包含“欧洲”,它是是还是否。 The final data should look like this:最终数据应如下所示:
df <- structure(list(ID = 1:10, Region1 = c("Europe", "NA",
"Asia", "NA", "Europe", "NA", "Africa", "NA", "Europe", "North America"), Region2 = c("NA", "Europe",
"NA", "NA", "NA", "Europe",
"NA", "NA", "NA", "NA"
), EuropeYN = c("yes", "yes", "no", "no", "yes", "yes", "no", "no", "yes", "no")),
class = "data.frame", row.names = c(NA, -10L))
I know how to do this if it was just checking to see if "Europe" appears in one column, but have no idea how to do this when checking across multiple columns.如果只是检查“欧洲”是否出现在一列中,我知道如何执行此操作,但不知道在检查多列时如何执行此操作。 This is what I would do if it was just one column:
如果它只是一列,我会这样做:
df$EuropeYN <- ifelse(grepl("Europe",df$region1), "yes", "no")
Any ideas on the best way to approach this?...关于解决这个问题的最佳方法的任何想法?...
A little late but maybe still worth a look:有点晚,但也许仍然值得一看:
library(dplyr)
library(stringr)
df %>%
rowwise() %>%
mutate(YN = +any(str_detect(c_across(Region1:Region2), 'Europe')))
# A tibble: 10 x 4
# Rowwise:
ID Region1 Region2 YN
<int> <chr> <chr> <int>
1 1 Europe NA 1
2 2 NA Europe 1
3 3 Asia NA 0
4 4 NA NA 0
5 5 Europe NA 1
6 6 NA Europe 1
7 7 Africa NA 0
8 8 NA NA 0
9 9 Europe NA 1
10 10 North America NA 0
or, without +
:或者,没有
+
:
df %>%
rowwise() %>%
mutate(YN = any(str_detect(c_across(Region1:Region2), 'Europe')))
# A tibble: 10 x 4
# Rowwise:
ID Region1 Region2 YN
<int> <chr> <chr> <lgl>
1 1 Europe NA TRUE
2 2 NA Europe TRUE
3 3 Asia NA FALSE
4 4 NA NA FALSE
5 5 Europe NA TRUE
6 6 NA Europe TRUE
7 7 Africa NA FALSE
8 8 NA NA FALSE
9 9 Europe NA TRUE
10 10 North America NA FALSE
If you have several columns across which you want to mutate
you can use starts_with
(or also contains
or ends_with
) to address these columns:如果您有几列想要
mutate
您可以使用starts_with
(或也contains
或ends_with
)来处理这些列:
df %>%
rowwise() %>%
mutate(YN = any(str_detect(c_across(starts_with('R')), 'Europe')))
我的方法与您的方法非常相似:
dplyr::mutate(df, EuropeYN = ifelse((Region1 == "Europe" | Region2 == "Europe"), "yes", "no"))
Two ways:两种方式:
Literally check each of two columns:从字面上检查两列中的每一列:
ifelse(df$Region1 == "Europe" | df$Region2 == "Europe", "yes", "no") # [1] "yes" "yes" "no" "no" "yes" "yes" "no" "no" "yes" "no"
This has the advantage of being easier to read (subjective) and very clear.这具有更易于阅读(主观)且非常清晰的优点。
Select a range of columns and look for equality:选择一系列列并寻找相等性:
subset(df, select = Region1:Region2) == "Europe" # Region1 Region2 # 1 TRUE FALSE # 2 FALSE TRUE # 3 FALSE FALSE # 4 FALSE FALSE # 5 TRUE FALSE # 6 FALSE TRUE # 7 FALSE FALSE # 8 FALSE FALSE # 9 TRUE FALSE # 10 FALSE FALSE apply(subset(df, select = Region1:Region2) == "Europe", 1, any) # 1 2 3 4 5 6 7 8 9 10 # TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
This allows us to use 1 or more columns.这允许我们使用 1 个或多个列。
Either of those can be assigned back into the frame with df$EuropeYN <- ...
.可以使用
df$EuropeYN <- ...
将其中任何一个分配回框架。
Here is a vectorized base R way.这是矢量化的基本 R 方式。
i <- rowSums(df[grep("Region", names(df))] == "Europe") > 0
df$EuropeYN <- c("no", "yes")[i + 1L]
We may use if_any
here as a vectorized option in tidyverse
我们可能会使用
if_any
这里的矢量选项tidyverse
library(dplyr)
library(stringr)
df %>%
mutate(YN = if_any(starts_with("Region"), str_detect, 'Europe'))
ID Region1 Region2 YN
1 1 Europe NA TRUE
2 2 NA Europe TRUE
3 3 Asia NA FALSE
4 4 NA NA FALSE
5 5 Europe NA TRUE
6 6 NA Europe TRUE
7 7 Africa NA FALSE
8 8 NA NA FALSE
9 9 Europe NA TRUE
10 10 North America NA FALSE
Or in base R
或者在
base R
df$YN <- Reduce(`|`, lapply(df[startsWith(names(df), 'Region')],
`%in%`, 'Europe'))
NOTE: It is easier to subset with a logical flag instead of "Yes"/"No"
注意:使用逻辑标志而不是
"Yes"/"No"
更容易进行子集化
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.