[英]Assigning labels to rows when there exists some value in a row that is the same as another row
I am doing some data cleaning and I came across this problem.我正在做一些数据清理,我遇到了这个问题。 An example of my dataframe (some entries are missing) is:
我的 dataframe (缺少一些条目)的示例是:
A![]() |
B![]() |
C ![]() |
---|---|---|
1 ![]() |
7 ![]() |
|
1 ![]() |
8 ![]() |
|
1 ![]() |
9 ![]() |
|
1 ![]() |
||
2 ![]() |
2 ![]() |
5 ![]() |
2 ![]() |
5 ![]() |
|
3 ![]() |
||
4 ![]() |
5 ![]() |
9 ![]() |
5 ![]() |
and my expected output dataframe is:我预期的 output dataframe 是:
A![]() |
B![]() |
C ![]() |
Label ![]() |
---|---|---|---|
1 ![]() |
7 ![]() |
0 ![]() |
|
1 ![]() |
8 ![]() |
0 ![]() |
|
1 ![]() |
9 ![]() |
0 ![]() |
|
1 ![]() |
0 ![]() |
||
2 ![]() |
2 ![]() |
5 ![]() |
1 ![]() |
2 ![]() |
5 ![]() |
1 ![]() |
|
3 ![]() |
2 ![]() |
||
4 ![]() |
5 ![]() |
9 ![]() |
0 ![]() |
5 ![]() |
0 ![]() |
Are there ways in pandas/dplyr to get this output? pandas/dplyr 有没有办法得到这个 output?
Edit: For example, the value 1 appears in the first 4 rows of A so these rows should have the same label.编辑:例如,值 1 出现在 A 的前 4 行中,因此这些行应该具有相同的 label。 The 3rd row and 2nd last row has value 9 in column C so they should also have the same label.
第三行和倒数第二行在 C 列中的值为 9,因此它们也应该具有相同的 label。 The last 2 rows have values 5 in column B so they should also be the same label.
最后两行在 B 列中的值为 5,因此它们也应该是相同的 label。 The 3rd last row does not have any values in the row that matches any value in the same column of any row so it is a unique label.
最后第三行没有任何值与任何行的同一列中的任何值匹配,因此它是唯一的 label。
This will accomplish the task.这将完成任务。 In your example, there is a slight conflict with your comment.
在您的示例中,与您的评论略有冲突。 Since rows 8 and 9 have label 0 and contain 5's in at least one column, rows 5 and 6 should retroactively be changed to label 0, because they contain 5's in at least one column as well.
由于第 8 行和第 9 行的 label 为 0 并且在至少一列中包含 5,因此第 5 行和第 6 行应追溯更改为 label 0,因为它们在至少一列中也包含 5。
text="A B C
1 7 NA
1 NA 8
1 NA 9
1 NA NA
2 2 5
2 NA 5
3 NA NA
4 5 9
5 NA NA"
df=read.table(text=text, header=TRUE)
my_maps=list()
my_maps[[1]]=unique(unlist(df[1,]))
my_maps[[1]]=my_maps[[1]][!is.na(my_maps[[1]])]
for (i in 2:nrow(df)) {
currow=df[i,]
currow=unique(unlist(currow))
currow=currow[!is.na(currow)]
boo=TRUE
for (k in 1:length(my_maps)) {
if (length(intersect(my_maps[[k]], currow))>0) {
my_maps[[k]]=union(my_maps[[k]], currow)
boo=FALSE
}
}
if (boo) {
my_maps[[length(my_maps)+1]]=currow
}
}
i=1
while (i < length(my_maps)) {
j=i+1
while (j <= length(my_maps)) {
if (length(intersect(my_maps[[i]], my_maps[[j]]))>0) {
my_maps[[i]]=union(my_maps[[i]], my_maps[[j]])
my_maps[[j]]=NULL
}
j=j+1
}
i=i+1
}
label=c()
for (i in 1:nrow(df)) {
for (j in 1:length(my_maps)) {
if (length((intersect(my_maps[[j]], unique(unlist(df[i,])))))>0) {
label[i]=j
break
}
}
}
df=mutate(df, label=label-1)
A B C label
1 1 7 NA 0
2 1 NA 8 0
3 1 NA 9 0
4 1 NA NA 0
5 2 2 5 0
6 2 NA 5 0
7 3 NA NA 1
8 4 5 9 0
9 5 NA NA 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.