简体   繁体   English

基于多列组合R中的数据帧行

[英]Combine data frame rows in R based on multiple columns

I have a data frame in R which has one individual per line. 我在R中有一个数据帧,每行有一个个体。 Sometimes, individuals appear on two lines, and I would like to combine these lines based on the duplicated ID. 有时,个人出现在两行,我想根据重复的ID组合这些行。

The problem is, each individual has multiple IDs, and when an ID appears twice, it does not necessarily appear in the same column . 问题是,每个人都有多个ID,当ID出现两次时, 它不一定出现在同一列中

Here is an example data frame: 这是一个示例数据框:

dat <- data.frame(a = c('cat', 'canine', 'feline', 'dog'),
                  b = c('feline', 'puppy', 'meower', 'wolf'),
                  c = c('kitten', 'barker', 'kitty', 'canine'),
                  d = c('shorthair', 'collie', '', ''),
                  e = c(1, 5, 3, 8))

> dat
       a      b      c         d e
1    cat feline kitten shorthair 1
2 canine  puppy barker    collie 5
3 feline meower  kitty           3
4    dog   wolf canine           8

So rows 1 and 3 should be combined, because ID b of row 1 equals ID a of row 3. Similarly, ID a of row 2 equals ID c of row 4, so those rows should be combined as well. 这样行1和3应结合,因为ID b行1的ID等于a行3.同样的,ID a排2的等于ID c 4行的,所以那些行应结合为好。

Ideally, the output should look like this. 理想情况下,输出应该如下所示。

     a.1    b.1    c.1       d.1 e.1    a.2    b.3    c.2 d.2 e.2
1    cat feline kitten shorthair   1 feline meower  kitty       3
2 canine  puppy barker    collie   5    dog   wolf canine       8

(Note that the rows were not combined based on sharing IDs that are empty strings.) (请注意,根据作为空字符串的共享ID,未合并行。)

My thoughts on how this could be done are below, but I'm pretty sure that I've been headed down the wrong path, so they're probably not helpful in solving the problem. 关于如何做到这一点的我的想法如下,但我很确定我已经走错了路,所以他们可能没有帮助解决问题。

I thought that I could assign a row ID to each row, then melt the data. 我以为我可以为每一行分配行ID,然后融化数据。 After that, I could to through row by row. 在那之后,我可以逐行。 When I found a row where one of the IDs matched an earlier row (eg when one of the row 3 IDs matches one of the row 1 IDs), I would change the every instance of the current row's row ID to match the earlier row ID (eg all row IDs of 3 would be changed to 1). 当我找到其中一个ID与前一行匹配的行时(例如,当第3行ID中的一个与第1行ID之一匹配时),我将更改当前行的行ID的每个实例以匹配先前的行ID (例如,所有行ID为3将更改为1)。

Here's the code I've been using: 这是我一直在使用的代码:

dat$row.id <- 1:nrow(dat)
library(reshape2)
dat.melt <- melt(dat, id.vars = c('e', 'row.id'))
for (i in 2:nrow(dat.melt)) {
  # This next step is just to ignore the empty values
  if (grepl('^[[:space:]]*$', dat.melt$value[i])) {
    next
  }
  earlier.instance <- dat.melt$row.id[which(dat.melt$value[1:(i-1)] == dat.melt$value[i])]
  if (length(earlier.instance) > 0) {
    earlier.row.id <- earlier.instance[1]
    dat.melt$row.id[dat.melt$row.id == dat.melt$row.id[i]] <- earlier.row.id
  }
}

There are two problems with this approach. 这种方法存在两个问题。

  1. It could be that an ID in row 3 matches row 1, and a different ID in row 5 matches row 3. In this case, the row IDs for both row 3 and row 5 should be changed to 1. This means that it's important to go through the rows sequentially, which has been leading me to use a for loop, not an apply function. 可能是第3行中的ID与第1行匹配,第5行中的不同ID与第3行匹配。在这种情况下,第3行和第5行的行ID应更改为1.这意味着它很重要顺序遍历行,这导致我使用for循环,而不是apply函数。 I know that this is not very R-like, and with the large data frame I am working with it is very slow. 我知道这不是很像R,而且我使用的大数据框架非常慢。
  2. This code produces the output below. 此代码生成下面的输出。 There are now multiple rows with identical row.id and variable , so I don't know how to cast it in order to get the kind of output I showed above. 现在有多行具有相同的row.idvariable ,所以我不知道如何转换它以获得我上面显示的那种输出。 Using dcast here will be forced to use an aggregation function. 在这里使用dcast将被迫使用聚合函数。

Output: 输出:

   e row.id variable     value
1  1      3        a       cat
2  5      2        a    canine
3  3      3        a    feline
4  8      2        a       dog
5  1      3        b    feline
6  5      2        b     puppy
7  3      3        b    meower
8  8      2        b      wolf
9  1      3        c    kitten
10 5      2        c    barker
11 3      3        c     kitty
12 8      2        c    canine
13 1      3        d shorthair
14 5      2        d    collie
15 3      3        d          
16 8      2        d          

New answer. 新答案。 Had some fun (/frustration) working through this. 有一些乐趣(/挫折)通过这个工作。 I'm sure it is not the fastest solution but it should get you past where my other answer left off. 我敢肯定这不是最快的解决方案,但它应该让你过去我的其他答案。 Let me explain: 让我解释:

dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'cat','fido'),
                  b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
                  c = c('kit', 'barker', 'kitty', 'canine', 'feline','wolf'),
                  d = c('shorthair', 'collie', '', '','',''),
                  e = c(1, 2, 3, 4, 5, 6))

dat[, All := paste(a, b,c),]

Two changes: dat$e is now an index column, so it is just the numeric position of whichever row it is. 两个更改: dat$e现在是一个索引列,因此它只是它所在行的数字位置。 If e is otherwise important, you can make a new column to replace it. 如果e非常重要,您可以创建一个新列来替换它。

Below is the first loop. 下面是第一个循环。 This makes 3 new columns FirstMatchingID etc. These are like before: they give the index of the earliest (lowest row #) matching dat$All for a b and c . 这使得3分新列FirstMatchingID等,这些都是像以前一样:他们给出的最早(最低行号)的折射率匹配dat$Alla bc

for(i in 2:nrow(dat)) {
  x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
  y <- max(which(x %in% TRUE))
  dat[i, FirstMatchingID := dat[i-y]$e]

  x2 <- grepl(dat[i]$b, dat[i-(1:i)]$All)
  y2 <- max(which(x2 %in% TRUE))
  dat[i, SecondMatchingID := dat[i-y2]$e]

  x3 <- grepl(dat[i]$c, dat[i-(1:i)]$All)
  y3 <- max(which(x3 %in% TRUE))
  dat[i, ThirdMatchingID := dat[i-y3]$e]

}

Next, we use pmin to find the earliest matching row of the MatchingID columns and set it in its own columns. 接下来,我们使用pmin查找MatchingID列的最早匹配行,并将其设置在自己的列中。 This is in case you have a match a in row 25 and a match for b in row 12; 这是如果你有一个匹配a行25和匹配b行12; it will give you 12 (I assume this is what you'd want based on your question). 它会给你12(我认为这是你想要的根据你的问题)。

dat$MinID <- pmin(dat$FirstMatchingID, dat$SecondMatchingID, dat$ThirdMatchingID, na.rm=T)

Last, this loop will do 3 things, creating a FinalID column with all the matching ID numbers from e : 最后,这个循环将完成3件事,创建一个FinalID列,其中包含来自e所有匹配的ID号:

  1. Where MinID is NA (no matches) set FinalID to e 其中MinIDNA (无匹配),将FinalID设置为e
  2. If MinID is a number, find that row (the earliest match) and check if its MinID is a number; 如果MinID是一个数字,找到该行(最早的匹配)并检查 MinID是否为数字; if it is not, there are no earlier matches and it sets FinalID to MinID 如果不是,则没有先前的匹配,并将FinalID设置为MinID
  3. The rows that don't fit the above condition are your special cases where row i s earliest match has an earlier match itself. 不符合上述条件的行是特殊情况,其中i最早匹配的行本身具有较早的匹配。 This will find that match and set it to FinalID . 这将找到匹配并将其设置为FinalID

for (i in 1:nrow(dat)) { x <- dat[i]$MinID if (is.na(dat[i]$MinID)) { dat[i, FinalID := e] } else if (is.na(dat[x]$MinID)) { dat[i, FinalID := MinID] } else dat[i, FinalID := dat[x]$MinID] }

I think this should do it; 我认为应该这样做; let me know how it goes. 让我知道事情的后续。 I make no claims about its efficiency or speed. 我没有声称它的效率或速度。

Here is an amateur attempt. 这是一次业余尝试。 I think it does some of what you want. 我认为它可以满足您的需求。 I have expanded the data.frame (now a data.table) two rows to give a better example. 我已经将data.frame(现在是data.table)扩展了两行,以提供更好的示例。

This loop creates a new column, dat$FirstMatchingID , that contains the ID from dat$e for the earliest match. 此循环创建一个新列dat$FirstMatchingID ,其中包含来自最早匹配的dat$e的ID。 I've only done it to match the first column, dat$a , but I think it could be expanded to b and c easily enough. 我只是为了匹配第一列, dat$a ,但我认为它可以很容易地扩展到bc

library(data.table)

dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'feline','puppy'),
                  b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
                  c = c('kitten', 'barker', 'kitty', 'canine', 'cat','wolf'),
                  d = c('shorthair', 'collie', '', '','',''),
                  e = c(1, 5, 3, 8, 4, 6))

dat[, All := paste(a, b,c),]

for(i in 2:nrow(dat)) {
  print(dat[i])
  x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
  y <- max(which(x %in% TRUE))
  dat[i, FirstMatchingID := dat[i-y]$e]
}

The result: 结果:

        a      b      c         d e                 All FirstMatchingID
1:    cat feline kitten shorthair 1   cat feline kitten              NA
2: canine  puppy barker    collie 5 canine puppy barker              NA
3: feline meower  kitty           3 feline meower kitty               1
4:    dog   wolf canine           8     dog wolf canine              NA
5: feline kitten    cat           4   feline kitten cat               1
6:  puppy    dog   wolf           6      puppy dog wolf               5

You then have to find out how you want to combine the rows to get your desired result, but hopefully this helps! 然后,您必须找出如何组合行以获得所需结果,但希望这会有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM