[英]Extract data from multiple columns in R data frame, then searching another
I have a central data frame of information (df3) that I'm trying to subset and add columns to based on data extracted from several columns of another (df2), that itself comes from a subset of a third (df1). 我有信息(DF3)的中央数据帧我试图子集,并添加列,根据从另一个(DF2)的几列提取的数据,这本身来自第三(DF1)的一个子集。 I've managed to get so far by searching help and playing around with various functions, but I have reached an impasse.
通过搜索帮助并使用各种功能,我已经取得了一定的成就,但是我陷入了僵局。 I do hope you can help.
我希望您能提供帮助。
To begin with, the 3dfs are structured as follows: 首先,3dfs的结构如下:
#df1 - my initial search database
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
yesno <- c("Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No")
city <- c("London", "London", "Paris", "London", "Paris", "New York", "London", "London")
df1 <- cbind(id, yesno, city)
df1 <- as.data.frame(df1)
df1
#df2 - containing the data needed to search df3, but situated across columns
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
twitter <- c("@one","", "@three", "@four", "", "", "@seven", "")
email <- c("", "", "", "add4", "add5","", "add7", "")
mail <- c("", "postcode2", "", "","","","","postcode8")
df2 <- cbind(id, twitter, email, mail)
df2 <- as.data.frame(df2)
df2
#df3 - the central df containing the data I wish to extract
comms <- c("@one", "postcode2", "@three", "@four", "add4", "add5", "six" "@seven", "add7", "postcode2")
target <- c("text1", "text2", "text3", "text4.1", "text4.2", "text5", "text6", "text7.1","text7.2", "text8")
df3 <- cbind(comms,target)
df3 <- as.data.frame(df3)
df3
The commonality between df1 and df2 is found in the id columns. df1和df2之间的共性可在id列中找到。 I've so far been able to filter df1 and extract the ids, which I've then used to subset df2.
到目前为止,我已经能够过滤df1并提取ID,然后将其用于子集df2。
df_search <- df1 %>%
filter(yesno == "Yes", city == "London")
df_search_ids <- df_search$id
df2_search <- df2 %>%
filter(id %in% df_search_ids)
df2_search
id twitter email mail
1 id1 @one
2 id2 postcode2
3 id4 @four add4
4 id7 @seven add7
My problems are: the common data between df2 and df3 are spread across three different columns of df2 (twitter, email and mail); 我的问题是:df2和df3之间的通用数据分布在df2的三个不同列中(推特,电子邮件和邮件); these columns contain blank cells and other extraneous info (eg 'I am not on Twitter');
这些列包含空白单元格和其他无关信息(例如“我不在Twitter上”); and finally that some of the entries in df2 (such as id4 and id7 above) have more than one entry in df3.
最后,df2中的某些条目(例如上述id4和id7)在df3中具有多个条目。
The solution I am trying to reach is that I would like to extract all instances from the columns twitter, email and mail of df2 based on a match with the ids extracting from df1, so that the extracted info can then be applied to subset df3 and eventually results in a new df(target_res) that looks like this: 我尝试达到的解决方案是,我想基于与从df1提取的ID的匹配,从df2的twitter,电子邮件和邮件列中提取所有实例,以便随后将提取的信息应用于子集df3和最终产生一个新的df(target_res),如下所示:
id_res <- c("id1", "id2", "id4", "id4", "id7", "id7")
comms_res <- c("@one", "postcode2", "@four", "add4", "@seven", "add7")
target_res <- c("text1", "text2", "text4.1", "text4.2", "text7.1", "text7.2")
result_df <- cbind(id_res, comms_res, target_res)
result_df <- as.data.frame(result_df)
result_df
id_res comms_res target_res
1 id1 @one text1
2 id2 postcode2 text2
3 id4 @four text4.1
4 id4 add4 text4.2
5 id7 @seven text7.1
6 id7 add7 text7.2
This is an action I will be performing a number of times (based on different explorations of df1), so ideally would be replicable. 我将多次执行此操作(基于对df1的不同探索),因此理想情况下将是可复制的。
I hope this is a clear explanation of the issue. 我希望这是对该问题的明确解释。
The key is to use tidyr::gather
to gather the twitter:mail
columns (from your filtered df2_search
) as rows under a new column comms
and then filter
again to remove the empty ""
rows. 关键是使用
tidyr::gather
收集twitter:mail
列(来自过滤的df2_search
)作为新列comms
下的行,然后再次filter
以删除空的""
行。 Your second pipe can then be: 您的第二个管道可以是:
library(dplyr)
result <- df2 %>% filter(id %in% df_search_ids) %>%
gather("source","comms",twitter:mail) %>%
filter(comms != "") %>%
inner_join(df3, by="comms") %>%
select(id_res=id,comms_res=comms,target_res=target) %>%
arrange(id_res)
The look up for df3
is then an inner_join
by comms
, which keeps only the rows matched in both data frames. df3
的inner_join
是comms
的inner_join
,它仅保留两个数据帧中匹配的行。 The rest is formatting the output result
. 其余的将格式化输出
result
。
With this you should get with your input: 有了这个,您应该得到输入:
print(result)
## id_res comms_res target_res
##1 id1 @one text1
##2 id2 postcode2 text2
##3 id2 postcode2 text8
##4 id4 @four text4.1
##5 id4 add4 text4.2
##6 id7 @seven text7.1
##7 id7 add7 text7.2
##Warning messages:
##1: attributes are not identical across measure variables; they will be dropped
##2: In inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
## joining character vector and factor, coercing into character vector
Edit to get rid of warnings 编辑以消除警告
As evident above, there are two warnings from the processing: 如上所示,处理过程中有两个警告 :
gather
, and the explanation for this is found here . gather
,有关此的解释在这里 。 inner_join
. inner_join
。 A trivial solution to get rid of both of these warnings is to convert the relevant data columns from factors to character vectors. 摆脱这两种警告的简单解决方案是将相关数据列从因子转换为字符向量。 For the warning from
gather
, the columns twitter
, email
, and mail
from df2
need to be converted, and from the inner_join
, the column comms
from df3
needs to be converted. 对于从警告
gather
,列twitter
, email
和mail
从df2
需要转换,并从inner_join
,列comms
从df3
需要转换。 This can be done using: 可以使用以下方法完成:
df2[,2:4] <- sapply(df2[,2:4], as.character)
df3$comms <- as.character(df3$comms)
before processing. 在处理之前。
Note that the result$comms_res
column is now a character vector instead of a factor with levels from the original df3$comms
(actually, even if we did not convert to characters, the result will be a character vector because inner_join
does it for us as the warning says). 请注意,
result$comms_res
列现在是字符向量,而不是原始df3$comms
具有水平的因子(实际上,即使我们没有转换为字符,结果也将是字符向量,因为inner_join
为我们完成了警告说)。 This is OK if we don't care to preserve the factor in the result
. 如果我们不在乎保留
result
的因素,可以的。 However, if we actually do care about the set of possible levels from df3$comms
that we want to preserve in result$comms_res
, then we need to first save these from df3$comms
before converting to characters: 但是,如果我们确实关心要保存在
result$comms_res
df3$comms
可能的级别集,那么我们需要先将它们保存在df3$comms
然后再转换为字符:
## save these levels before converting to characters
df3.comms.levels <- levels(df3$comms)
df3$comms <- as.character(df3$comms)
and then convert both df3$comms
and result$comms_res
back to a factor with these levels after processing: 然后在处理后将
df3$comms
和result$comms_res
回这些水平的因子:
df3$comms <- factor(df3$comms, levels=df3.comms.levels)
result$comms_res <- factor(result$comms_res, levels=df3.comms.levels)
Hope this helps. 希望这可以帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.