简体   繁体   English

交叉连接 ID 从同一 R 数据帧中的其他列中提取数据

[英]cross-join ids to extract data from other columns within the same R data frame

I have an R data frame like this one (but data wouldn't be sorted by any column):我有一个像这样的 R 数据框(但数据不会按任何列排序):

ppl <- structure(list(id = c("I0000", "I0001", "I0002", "I0003", "I0004","I0005", "I0006", "I0007", "I0008", "I0009"), Birth_Date = structure(c(NA, 517, -10246, -8723, 2349, -25125, NA, -12141, 2349, NA), class = "Date"), Father_id = c(NA, "I0002", "I0005", "I0037", "I0002", "I0018", "I0056", "I0005", "I0002", "I0005"), Mother_id = c(NA, "I0003", "I0006", "I0038", "I0003", "I0019", "I0057", "I0006", "I0003", "I0006"), marriage = structure(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, -12119, -12119, NA_real_, NA_real_, NA_real_), class = "Date")), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

> ppl
# A tibble: 10 x 5
   id    Birth_Date Father_id Mother_id marriage  
   <chr> <date>     <chr>     <chr>     <date>    
 1 I0000 NA         NA        NA        NA        
 2 I0001 1971-06-02 I0002     I0003     NA        
 3 I0002 1941-12-13 I0005     I0006     NA        
 4 I0003 1946-02-13 I0037     I0038     NA        
 5 I0004 1976-06-07 I0002     I0003     NA        
 6 I0005 1901-03-19 I0018     I0019     1936-10-27        
 7 I0006 NA         I0056     I0057     1936-10-27        
 8 I0007 1936-10-05 I0005     I0006     NA        
 9 I0008 1976-06-07 I0002     I0003     NA        
10 I0009 NA         I0005     I0006     NA    

Children and parents relationships are stablished through their different IDs.孩子和父母的关系是通过他们不同的身份建立起来的。

For each individual ( id ) without a marriage date value, I want to estimate a date value for that column, based on the Birth_date of his/her first child (of course this is just an assumption, since for some people Birth_Date is not available).对于没有结婚日期值的每个人( id ),我想根据他/她的第一个孩子Birth_date估计该列的日期值(当然这只是一个假设,因为对于某些人Birth_Date不可用)。

So, in this example, some individues which would get a marriage date would be I0002 and I0003 (calculated marriage would be "1971-06-02" in rows 3 and 4, because it is the minimum Birth_Date of the 3 people which have Father_id =='I0002' and Mother_id =='I0003' -rows 2, 5 and 9-).所以,在这个例子中,一些会得到结婚日期的个体将是 I0002 和 I0003(计算的婚姻在第 3 行和第 4 行中将是“1971-06-02”,因为它是具有Father_id的 3 个人的最小Birth_Date =='I0002' 和Mother_id =='I0003' -第 2、5 和 9 行-)。

The same way, individues I0005 and I0006 would get marriage date "1936-10-05", which is the minimum known Birth_Date of their children (I0002, I0007 and I0009 -which has NA as Birth_Date -).同样,个人 I0005 和 I0006 将获得结婚日期“1936-10-05”,这是他们孩子的最小已知Birth_Date (I0002、 I0007和 I0009 - 其中NA作为Birth_Date -)。 But in this case, all children Birth_Date values should not be taken in account because the data frame has already a real marriage_date value for these individues ("1936-10-27").在这种情况下,不应考虑所有孩子的 Birth_Date值,因为数据框已经具有这些个体的真实结婚日期值(“1936-10-27”)。

As you can see, dataframe structure has not to be changed (same number of rows and same columns; but the last one gets some NA updated with a Date value).如您所见,dataframe 结构无需更改(相同的行数和相同的列;但最后一个使用 Date 值更新了一些 NA)。

Expected result:预期结果:

> ppl
# A tibble: 10 x 5
   id    Birth_Date Father_id Mother_id marriage  
   <chr> <date>     <chr>     <chr>     <date>    
 1 I0000 NA         NA        NA        NA        
 2 I0001 1971-06-02 I0002     I0003     NA        
 3 I0002 1941-12-13 I0005     I0006     1971-06-02
 4 I0003 1946-02-13 I0037     I0038     1971-06-02
 5 I0004 1976-06-07 I0002     I0003     NA        
 6 I0005 1901-03-19 I0018     I0019     1936-10-27
 7 I0006 NA         I0056     I0057     1936-10-27
 8 I0007 1936-10-05 I0005     I0006     NA        
 9 I0008 1976-06-07 I0002     I0003     NA        
10 I0009 NA         I0005     I0006     NA        

Is it possible to accomplish this task avoiding a function to iterate the data frame?是否可以避免使用 function 来迭代数据帧来完成这项任务?

I know there are libraries dealing with joins, like those mentioned here .我知道有处理连接的库,就像这里提到的那些。 But I still can't figure out how to use them to do this task.但我仍然不知道如何使用它们来完成这项任务。

I was thinking to calulate it row by row (one marriage date per iteration), but I guess there must be some fasters ways to do it.我正在考虑逐行计算(每次迭代一个结婚日期),但我想必须有一些更快的方法来做到这一点。 Please, elaborate a little bit your answer because I am a complete R-newbie.请详细说明您的答案,因为我是一个完整的 R 新手。 It's not just a matter of making it work, but of understanding how it works.这不仅仅是让它工作的问题,而是理解它是如何工作的。

We can select a row with minimum value of Birth_Date for each father and mother and join with the dataframe itself.我们可以Birth_Date为每个父亲和母亲创建一个具有最小值 Birth_Date 的行,并与 dataframe 本身连接。

library(dplyr)

ppl %>%
   #Keep only NA values
   filter(is.na(marriage)) %>%
   #For each father and mother
   group_by(Father_id, Mother_id) %>%
   #Select the minimum date
   slice(which.min(Birth_Date)) %>%
   #Get father and mother in same column
   tidyr::pivot_longer(cols = c(Father_id, Mother_id)) %>%
   #rename Birth_Date to marriage and select it with value
   select(marriage = Birth_Date, value) %>%
   #Join with the dataframe itself
   right_join(ppl, by = c('value' = 'id')) %>%
   #If marriage data is already present select that
   mutate(marriage_date = coalesce(marriage.y, marriage.x)) %>%
   #select only columns needed. 
   select(id = value, Birth_Date, Father_id, Mother_id, marriage_date)

   id    Birth_Date Father_id Mother_id marriage_date
   <chr> <date>     <chr>     <chr>     <date>       
 1 I0000 NA         NA        NA        NA           
 2 I0001 1971-06-02 I0002     I0003     NA           
 3 I0002 1941-12-13 I0005     I0006     1971-06-02   
 4 I0003 1946-02-13 I0037     I0038     1971-06-02   
 5 I0004 1976-06-07 I0002     I0003     NA           
 6 I0005 1901-03-19 I0018     I0019     1936-10-27   
 7 I0006 NA         I0056     I0057     1936-10-27   
 8 I0007 1936-10-05 I0005     I0006     NA           
 9 I0008 1976-06-07 I0002     I0003     NA           
10 I0009 NA         I0005     I0006     NA   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM