[英]How to fill columns in R based on matching parts of one column to values in another data frame
I have two data frames, one with my data ( data
) and one with a lookup table ( lookup
). 我有两个数据帧,一个带有我的数据( data
),一个带有查询表( lookup
)。 The data includes a column called claims
; 数据包括一个称为claims
的列; its cells are filled with one or more codes identifying the types of legal claims brought in a particular case (each row represents one case). 其单元格中填充着一个或多个代码,用于标识在特定案件中提出的法律要求的类型(每一行代表一个案件)。 Multiple types of claims are separated by a semicolon. 多种类型的索赔用分号分隔。
The lookup
data frame has three columns: code
, category
, and so_category
. lookup
数据框具有三列: code
, category
和so_category
。 The code
column lists each unique claim code used in the claims
column of data
. code
列列出了data
claims
列中使用的每个唯一索赔代码。 category
contains a category I assigned to that kind of claim, and so_category
assigns a higher-level category into which that particular category
fits. category
包含我分配给该声明的类别,而so_category
分配了适合该特定category
的更高级别的类别。
What I'm trying to do is add columns to data
for each category
and so_category
that will just be filled with 0 or 1 depending on whether there are claims
in the case that correspond to each category
and so_category
. 我想要做的就是添加列data
为每个category
和so_category
将只是充满取决于是否有0或1 claims
中对应于每个案件category
和so_category
。
Below is an example of what my data frames look like: 以下是我的数据框的示例:
data
Case claims
1 wiretap;fdcpa
2 ca_ucl;comlaw
3 tort;comlaw;wiretap;ca_ucl
lookup
code category so_category
wiretap f_wiretap f_statute
fdcpa f_con_prot f_statute
ca_ucl st_con_prot st_statute
comlaw com_law common_law
tort com_law common_law
So what I would like to generate programmatically is something like: 所以我想以编程方式生成如下内容:
data
Case claims f_stat st_stat common_law
1 wiretap;fdcpa 1 0 0
2 ca_ucl;comlaw 0 1 1
3 tort;comlaw;wiretap;ca_ucl 1 1 1
I'm quite new to R and am pretty much at a loss to figure out how to do this--any guidance would be highly appreciated! 我对R还是很陌生,很茫然地想出如何做到这一点-任何指导都将不胜感激!
In base R, we can find all the unique
so_category
( all_category
) with which we need to match. 在基数R中,我们可以找到所有需要匹配的unique
so_category
( all_category
)。 Split the claims
on ;
分割claims
;
and match
each one of them with the code
in lookup
and get the corresponding so_category
and give 1/0 values based on presence/absence of the category in all_category
. 并将它们match
每个与lookup
code
match
,并获得相应的so_category
并根据all_category
类别的存在/不存在给出1/0值。
all_category <- unique(lookup$so_category)
data[all_category] <- t(sapply(strsplit(data$claims, ";"), function(x)
as.integer(all_category %in% lookup$so_category[match(x, lookup$code)])))
data
# Case claims f_statute st_statute common_law
#1 1 wiretap;fdcpa 1 0 1
#2 2 ca_ucl;comlaw 0 1 1
#3 3 tort;comlaw;wiretap;ca_ucl 1 1 1
data 数据
data <- structure(list(Case = 1:3, claims = c("wiretap;fdcpa",
"ca_ucl;comlaw", "tort;comlaw;wiretap;ca_ucl")),
row.names = c(NA, -3L), class = "data.frame")
lookup <- structure(list(code = c("wiretap", "fdcpa", "ca_ucl", "comlaw",
"tort"), category = c("f_wiretap", "f_con_prot", "st_con_prot",
"com_law", "com_law"), so_category = c("f_statute", "f_statute",
"st_statute", "common_law", "common_law")), row.names = c(NA,
-5L), class = "data.frame")
Here is an option with tidyverse
, where we split the 'claims' column at the delimiter ;
这是tidyverse
一个选项,我们在定界符处拆分“ claims”列;
with separate_rows
, then do a join ( left_join
) with the 'lookup' dataset, spread
it to 'wide' format after getting the distinct
rows and join the output with the original dataset 与separate_rows
,然后做一个连接( left_join
)与“查找”数据集, spread
得到它后“宽”格式distinct
行,并加入与原始数据集输出
library(tidyverse)
data %>%
separate_rows(claims, sep=";") %>%
left_join(lookup, by = c("claims" = "code")) %>%
select(-claims, -category) %>%
distinct(Case, so_category) %>%
mutate(val = 1) %>%
spread(so_category, val, fill = 0) %>%
right_join(data) %>%
select(names(data), everything())
# Case claims common_law f_statute st_statute
#1 1 wiretap;fdcpa 0 1 0
#2 2 ca_ucl;comlaw 1 0 1
#3 3 tort;comlaw;wiretap;ca_ucl 1 1 1
data <- structure(list(Case = 1:3, claims = c("wiretap;fdcpa",
"ca_ucl;comlaw", "tort;comlaw;wiretap;ca_ucl")),
row.names = c(NA, -3L), class = "data.frame")
lookup <- structure(list(code = c("wiretap", "fdcpa", "ca_ucl", "comlaw",
"tort"), category = c("f_wiretap", "f_con_prot", "st_con_prot",
"com_law", "com_law"), so_category = c("f_statute", "f_statute",
"st_statute", "common_law", "common_law")), row.names = c(NA,
-5L), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.