[英]Combine data frame rows in R based on multiple columns
I have a data frame in R which has one individual per line. 我在R中有一个数据帧,每行有一个个体。 Sometimes, individuals appear on two lines, and I would like to combine these lines based on the duplicated ID.
有时,个人出现在两行,我想根据重复的ID组合这些行。
The problem is, each individual has multiple IDs, and when an ID appears twice, it does not necessarily appear in the same column . 问题是,每个人都有多个ID,当ID出现两次时, 它不一定出现在同一列中 。
Here is an example data frame: 这是一个示例数据框:
dat <- data.frame(a = c('cat', 'canine', 'feline', 'dog'),
b = c('feline', 'puppy', 'meower', 'wolf'),
c = c('kitten', 'barker', 'kitty', 'canine'),
d = c('shorthair', 'collie', '', ''),
e = c(1, 5, 3, 8))
> dat
a b c d e
1 cat feline kitten shorthair 1
2 canine puppy barker collie 5
3 feline meower kitty 3
4 dog wolf canine 8
So rows 1 and 3 should be combined, because ID b
of row 1 equals ID a
of row 3. Similarly, ID a
of row 2 equals ID c
of row 4, so those rows should be combined as well. 这样行1和3应结合,因为ID
b
行1的ID等于a
行3.同样的,ID a
排2的等于ID c
4行的,所以那些行应结合为好。
Ideally, the output should look like this. 理想情况下,输出应该如下所示。
a.1 b.1 c.1 d.1 e.1 a.2 b.3 c.2 d.2 e.2
1 cat feline kitten shorthair 1 feline meower kitty 3
2 canine puppy barker collie 5 dog wolf canine 8
(Note that the rows were not combined based on sharing IDs that are empty strings.) (请注意,根据作为空字符串的共享ID,未合并行。)
My thoughts on how this could be done are below, but I'm pretty sure that I've been headed down the wrong path, so they're probably not helpful in solving the problem. 关于如何做到这一点的我的想法如下,但我很确定我已经走错了路,所以他们可能没有帮助解决问题。
I thought that I could assign a row ID to each row, then melt the data. 我以为我可以为每一行分配行ID,然后融化数据。 After that, I could to through row by row.
在那之后,我可以逐行。 When I found a row where one of the IDs matched an earlier row (eg when one of the row 3 IDs matches one of the row 1 IDs), I would change the every instance of the current row's row ID to match the earlier row ID (eg all row IDs of 3 would be changed to 1).
当我找到其中一个ID与前一行匹配的行时(例如,当第3行ID中的一个与第1行ID之一匹配时),我将更改当前行的行ID的每个实例以匹配先前的行ID (例如,所有行ID为3将更改为1)。
Here's the code I've been using: 这是我一直在使用的代码:
dat$row.id <- 1:nrow(dat)
library(reshape2)
dat.melt <- melt(dat, id.vars = c('e', 'row.id'))
for (i in 2:nrow(dat.melt)) {
# This next step is just to ignore the empty values
if (grepl('^[[:space:]]*$', dat.melt$value[i])) {
next
}
earlier.instance <- dat.melt$row.id[which(dat.melt$value[1:(i-1)] == dat.melt$value[i])]
if (length(earlier.instance) > 0) {
earlier.row.id <- earlier.instance[1]
dat.melt$row.id[dat.melt$row.id == dat.melt$row.id[i]] <- earlier.row.id
}
}
There are two problems with this approach. 这种方法存在两个问题。
row.id
and variable
, so I don't know how to cast it in order to get the kind of output I showed above. row.id
和variable
,所以我不知道如何转换它以获得我上面显示的那种输出。 Using dcast
here will be forced to use an aggregation function. dcast
将被迫使用聚合函数。 Output: 输出:
e row.id variable value
1 1 3 a cat
2 5 2 a canine
3 3 3 a feline
4 8 2 a dog
5 1 3 b feline
6 5 2 b puppy
7 3 3 b meower
8 8 2 b wolf
9 1 3 c kitten
10 5 2 c barker
11 3 3 c kitty
12 8 2 c canine
13 1 3 d shorthair
14 5 2 d collie
15 3 3 d
16 8 2 d
New answer. 新答案。 Had some fun (/frustration) working through this.
有一些乐趣(/挫折)通过这个工作。 I'm sure it is not the fastest solution but it should get you past where my other answer left off.
我敢肯定这不是最快的解决方案,但它应该让你过去我的其他答案。 Let me explain:
让我解释:
dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'cat','fido'),
b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
c = c('kit', 'barker', 'kitty', 'canine', 'feline','wolf'),
d = c('shorthair', 'collie', '', '','',''),
e = c(1, 2, 3, 4, 5, 6))
dat[, All := paste(a, b,c),]
Two changes: dat$e
is now an index column, so it is just the numeric position of whichever row it is. 两个更改:
dat$e
现在是一个索引列,因此它只是它所在行的数字位置。 If e
is otherwise important, you can make a new column to replace it. 如果
e
非常重要,您可以创建一个新列来替换它。
Below is the first loop. 下面是第一个循环。 This makes 3 new columns
FirstMatchingID
etc. These are like before: they give the index of the earliest (lowest row #) matching dat$All
for a
b
and c
. 这使得3分新列
FirstMatchingID
等,这些都是像以前一样:他们给出的最早(最低行号)的折射率匹配dat$All
的a
b
和c
。
for(i in 2:nrow(dat)) {
x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
y <- max(which(x %in% TRUE))
dat[i, FirstMatchingID := dat[i-y]$e]
x2 <- grepl(dat[i]$b, dat[i-(1:i)]$All)
y2 <- max(which(x2 %in% TRUE))
dat[i, SecondMatchingID := dat[i-y2]$e]
x3 <- grepl(dat[i]$c, dat[i-(1:i)]$All)
y3 <- max(which(x3 %in% TRUE))
dat[i, ThirdMatchingID := dat[i-y3]$e]
}
Next, we use pmin
to find the earliest matching row of the MatchingID
columns and set it in its own columns. 接下来,我们使用
pmin
查找MatchingID
列的最早匹配行,并将其设置在自己的列中。 This is in case you have a match a
in row 25 and a match for b
in row 12; 这是如果你有一个匹配
a
行25和匹配b
行12; it will give you 12 (I assume this is what you'd want based on your question). 它会给你12(我认为这是你想要的根据你的问题)。
dat$MinID <- pmin(dat$FirstMatchingID, dat$SecondMatchingID, dat$ThirdMatchingID, na.rm=T)
Last, this loop will do 3 things, creating a FinalID
column with all the matching ID numbers from e
: 最后,这个循环将完成3件事,创建一个
FinalID
列,其中包含来自e
所有匹配的ID号:
MinID
is NA
(no matches) set FinalID
to e
MinID
为NA
(无匹配),将FinalID
设置为e
MinID
is a number, find that row (the earliest match) and check if its MinID
is a number; MinID
是一个数字,找到该行(最早的匹配)并检查其 MinID
是否为数字; if it is not, there are no earlier matches and it sets FinalID
to MinID
FinalID
设置为MinID
i
s earliest match has an earlier match itself. i
最早匹配的行本身具有较早的匹配。 This will find that match and set it to FinalID
. FinalID
。 for (i in 1:nrow(dat)) { x <- dat[i]$MinID if (is.na(dat[i]$MinID)) { dat[i, FinalID := e] } else if (is.na(dat[x]$MinID)) { dat[i, FinalID := MinID] } else dat[i, FinalID := dat[x]$MinID] }
I think this should do it; 我认为应该这样做; let me know how it goes.
让我知道事情的后续。 I make no claims about its efficiency or speed.
我没有声称它的效率或速度。
Here is an amateur attempt. 这是一次业余尝试。 I think it does some of what you want.
我认为它可以满足您的需求。 I have expanded the data.frame (now a data.table) two rows to give a better example.
我已经将data.frame(现在是data.table)扩展了两行,以提供更好的示例。
This loop creates a new column, dat$FirstMatchingID
, that contains the ID from dat$e
for the earliest match. 此循环创建一个新列
dat$FirstMatchingID
,其中包含来自最早匹配的dat$e
的ID。 I've only done it to match the first column, dat$a
, but I think it could be expanded to b
and c
easily enough. 我只是为了匹配第一列,
dat$a
,但我认为它可以很容易地扩展到b
和c
。
library(data.table)
dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'feline','puppy'),
b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
c = c('kitten', 'barker', 'kitty', 'canine', 'cat','wolf'),
d = c('shorthair', 'collie', '', '','',''),
e = c(1, 5, 3, 8, 4, 6))
dat[, All := paste(a, b,c),]
for(i in 2:nrow(dat)) {
print(dat[i])
x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
y <- max(which(x %in% TRUE))
dat[i, FirstMatchingID := dat[i-y]$e]
}
The result: 结果:
a b c d e All FirstMatchingID
1: cat feline kitten shorthair 1 cat feline kitten NA
2: canine puppy barker collie 5 canine puppy barker NA
3: feline meower kitty 3 feline meower kitty 1
4: dog wolf canine 8 dog wolf canine NA
5: feline kitten cat 4 feline kitten cat 1
6: puppy dog wolf 6 puppy dog wolf 5
You then have to find out how you want to combine the rows to get your desired result, but hopefully this helps! 然后,您必须找出如何组合行以获得所需结果,但希望这会有所帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.