简体   繁体   English

从R数据框中的多个列中提取数据,然后搜索另一个

[英]Extract data from multiple columns in R data frame, then searching another

I have a central data frame of information (df3) that I'm trying to subset and add columns to based on data extracted from several columns of another (df2), that itself comes from a subset of a third (df1). 我有信息(DF3)的中央数据帧我试图子集,并添加列,根据从另一个(DF2)的几列提取的数据,这本身来自第三(DF1)的一个子集。 I've managed to get so far by searching help and playing around with various functions, but I have reached an impasse. 通过搜索帮助并使用各种功能,我已经取得了一定的成就,但是我陷入了僵局。 I do hope you can help. 我希望您能提供帮助。

To begin with, the 3dfs are structured as follows: 首先,3dfs的结构如下:

#df1 - my initial search database
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
yesno <- c("Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No")
city <- c("London", "London", "Paris", "London", "Paris", "New York", "London", "London")
df1 <- cbind(id, yesno, city)
df1 <- as.data.frame(df1)
df1

#df2 - containing the data needed to search df3, but situated across columns
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
twitter <- c("@one","", "@three", "@four", "", "", "@seven", "")
email <- c("", "", "", "add4", "add5","", "add7", "")
mail <- c("", "postcode2", "", "","","","","postcode8")
df2 <- cbind(id, twitter, email, mail)
df2 <- as.data.frame(df2)
df2

#df3 - the central df containing the data I wish to extract
comms <- c("@one", "postcode2", "@three", "@four", "add4", "add5", "six" "@seven", "add7", "postcode2")
target <- c("text1", "text2", "text3", "text4.1", "text4.2", "text5", "text6", "text7.1","text7.2", "text8")
df3 <- cbind(comms,target)
df3 <- as.data.frame(df3)
df3

The commonality between df1 and df2 is found in the id columns. df1和df2之间的共性可在id列中找到。 I've so far been able to filter df1 and extract the ids, which I've then used to subset df2. 到目前为止,我已经能够过滤df1并提取ID,然后将其用于子集df2。

   df_search <- df1 %>%
   filter(yesno == "Yes", city == "London")

   df_search_ids <- df_search$id

   df2_search <- df2 %>%
   filter(id %in% df_search_ids)
   df2_search

       id twitter email      mail
     1 id1    @one                
     2 id2               postcode2
     3 id4   @four  add4          
     4 id7  @seven  add7     

My problems are: the common data between df2 and df3 are spread across three different columns of df2 (twitter, email and mail); 我的问题是:df2和df3之间的通用数据分布在df2的三个不同列中(推特,电子邮件和邮件); these columns contain blank cells and other extraneous info (eg 'I am not on Twitter'); 这些列包含空白单元格和其他无关信息(例如“我不在Twitter上”); and finally that some of the entries in df2 (such as id4 and id7 above) have more than one entry in df3. 最后,df2中的某些条目(例如上述id4和id7)在df3中具有多个条目。

The solution I am trying to reach is that I would like to extract all instances from the columns twitter, email and mail of df2 based on a match with the ids extracting from df1, so that the extracted info can then be applied to subset df3 and eventually results in a new df(target_res) that looks like this: 我尝试达到的解决方案是,我想基于与从df1提取的ID的匹配,从df2的twitter,电子邮件和邮件列中提取所有实例,以便随后将提取的信息应用于子集df3和最终产生一个新的df(target_res),如下所示:

    id_res <- c("id1", "id2", "id4", "id4", "id7", "id7")
    comms_res <- c("@one", "postcode2", "@four", "add4", "@seven", "add7")
    target_res <- c("text1", "text2", "text4.1", "text4.2", "text7.1", "text7.2")
    result_df <- cbind(id_res, comms_res, target_res)
    result_df <- as.data.frame(result_df)
    result_df

      id_res comms_res target_res
    1    id1      @one      text1
    2    id2  postcode2      text2
    3    id4     @four    text4.1
    4    id4      add4    text4.2
    5    id7    @seven    text7.1
    6    id7      add7    text7.2    

This is an action I will be performing a number of times (based on different explorations of df1), so ideally would be replicable. 我将多次执行此操作(基于对df1的不同探索),因此理想情况下将是可复制的。

I hope this is a clear explanation of the issue. 我希望这是对该问题的明确解释。

The key is to use tidyr::gather to gather the twitter:mail columns (from your filtered df2_search ) as rows under a new column comms and then filter again to remove the empty "" rows. 关键是使用tidyr::gather收集twitter:mail列(来自过滤的df2_search )作为新列comms下的行,然后再次filter以删除空的""行。 Your second pipe can then be: 您的第二个管道可以是:

library(dplyr)

result <- df2 %>% filter(id %in% df_search_ids) %>% 
                  gather("source","comms",twitter:mail) %>% 
                  filter(comms != "") %>%
                  inner_join(df3, by="comms") %>% 
                  select(id_res=id,comms_res=comms,target_res=target) %>%
                  arrange(id_res)

The look up for df3 is then an inner_join by comms , which keeps only the rows matched in both data frames. df3inner_joincommsinner_join ,它仅保留两个数据帧中匹配的行。 The rest is formatting the output result . 其余的将格式化输出result

With this you should get with your input: 有了这个,您应该得到输入:

print(result)
##  id_res comms_res target_res
##1    id1      @one      text1
##2    id2 postcode2      text2
##3    id2 postcode2      text8
##4    id4     @four    text4.1
##5    id4      add4    text4.2
##6    id7    @seven    text7.1
##7    id7      add7    text7.2
##Warning messages:
##1: attributes are not identical across measure variables; they will be dropped 
##2: In inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
##  joining character vector and factor, coercing into character vector

Edit to get rid of warnings 编辑以消除警告

As evident above, there are two warnings from the processing: 如上所示,处理过程中有两个警告

  1. The first is from gather , and the explanation for this is found here . 首先是来自gather ,有关此的解释在这里
  2. The second is from the inner_join . 第二个是来自inner_join

A trivial solution to get rid of both of these warnings is to convert the relevant data columns from factors to character vectors. 摆脱这两种警告的简单解决方案是将相关数据列从因子转换为字符向量。 For the warning from gather , the columns twitter , email , and mail from df2 need to be converted, and from the inner_join , the column comms from df3 needs to be converted. 对于从警告gather ,列twitteremailmaildf2需要转换,并从inner_join ,列commsdf3需要转换。 This can be done using: 可以使用以下方法完成:

df2[,2:4] <- sapply(df2[,2:4], as.character)
df3$comms <- as.character(df3$comms)

before processing. 在处理之前。

Note that the result$comms_res column is now a character vector instead of a factor with levels from the original df3$comms (actually, even if we did not convert to characters, the result will be a character vector because inner_join does it for us as the warning says). 请注意, result$comms_res列现在是字符向量,而不是原始df3$comms具有水平的因子(实际上,即使我们没有转换为字符,结果也将是字符向量,因为inner_join为我们完成了警告说)。 This is OK if we don't care to preserve the factor in the result . 如果我们不在乎保留result的因素,可以的。 However, if we actually do care about the set of possible levels from df3$comms that we want to preserve in result$comms_res , then we need to first save these from df3$comms before converting to characters: 但是,如果我们确实关心要保存在result$comms_res df3$comms可能的级别集,那么我们需要先将它们保存在df3$comms 然后再转换为字符:

## save these levels before converting to characters
df3.comms.levels <- levels(df3$comms)
df3$comms <- as.character(df3$comms)

and then convert both df3$comms and result$comms_res back to a factor with these levels after processing: 然后处理后将 df3$commsresult$comms_res回这些水平的因子:

df3$comms <- factor(df3$comms, levels=df3.comms.levels)
result$comms_res <- factor(result$comms_res, levels=df3.comms.levels)

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM