简体   繁体   English

当行中存在与另一行相同的值时为行分配标签

[英]Assigning labels to rows when there exists some value in a row that is the same as another row

I am doing some data cleaning and I came across this problem.我正在做一些数据清理,我遇到了这个问题。 An example of my dataframe (some entries are missing) is:我的 dataframe (缺少一些条目)的示例是:

A一个 B C C
1 1 7 7
1 1 8 8
1 1 9 9
1 1
2 2 2 2 5 5
2 2 5 5
3 3
4 4 5 5 9 9
5 5

and my expected output dataframe is:我预期的 output dataframe 是:

A一个 B C C Label Label
1 1 7 7 0 0
1 1 8 8 0 0
1 1 9 9 0 0
1 1 0 0
2 2 2 2 5 5 1 1
2 2 5 5 1 1
3 3 2 2
4 4 5 5 9 9 0 0
5 5 0 0

Are there ways in pandas/dplyr to get this output? pandas/dplyr 有没有办法得到这个 output?

Edit: For example, the value 1 appears in the first 4 rows of A so these rows should have the same label.编辑:例如,值 1 出现在 A 的前 4 行中,因此这些行应该具有相同的 label。 The 3rd row and 2nd last row has value 9 in column C so they should also have the same label.第三行和倒数第二行在 C 列中的值为 9,因此它们也应该具有相同的 label。 The last 2 rows have values 5 in column B so they should also be the same label.最后两行在 B 列中的值为 5,因此它们也应该是相同的 label。 The 3rd last row does not have any values in the row that matches any value in the same column of any row so it is a unique label.最后第三行没有任何值与任何行的同一列中的任何值匹配,因此它是唯一的 label。

This will accomplish the task.这将完成任务。 In your example, there is a slight conflict with your comment.在您的示例中,与您的评论略有冲突。 Since rows 8 and 9 have label 0 and contain 5's in at least one column, rows 5 and 6 should retroactively be changed to label 0, because they contain 5's in at least one column as well.由于第 8 行和第 9 行的 label 为 0 并且在至少一列中包含 5,因此第 5 行和第 6 行应追溯更改为 label 0,因为它们在至少一列中也包含 5。

text="A B   C
1   7   NA
1   NA  8
1   NA  9
1   NA NA   
2   2   5
2   NA  5
3 NA NA
4   5   9
5 NA NA"
df=read.table(text=text, header=TRUE)
my_maps=list()
my_maps[[1]]=unique(unlist(df[1,]))
my_maps[[1]]=my_maps[[1]][!is.na(my_maps[[1]])]
for (i in 2:nrow(df)) {
  currow=df[i,]
  currow=unique(unlist(currow))
  currow=currow[!is.na(currow)]
  boo=TRUE
  for (k in 1:length(my_maps)) {
    if (length(intersect(my_maps[[k]], currow))>0) {
      my_maps[[k]]=union(my_maps[[k]], currow)
      boo=FALSE
    }
  }
  if (boo) {
    my_maps[[length(my_maps)+1]]=currow
  }
}

i=1
while (i < length(my_maps)) {
  j=i+1
  while (j <= length(my_maps)) {
    if (length(intersect(my_maps[[i]], my_maps[[j]]))>0) {
      my_maps[[i]]=union(my_maps[[i]], my_maps[[j]])
      my_maps[[j]]=NULL
    }
    j=j+1
  }
  i=i+1
}

label=c()
for (i in 1:nrow(df)) {
  for (j in 1:length(my_maps)) {
    if (length((intersect(my_maps[[j]], unique(unlist(df[i,])))))>0) {
      label[i]=j
      break
    }
  }
}

df=mutate(df, label=label-1)

  A  B  C label
1 1  7 NA     0
2 1 NA  8     0
3 1 NA  9     0
4 1 NA NA     0
5 2  2  5     0
6 2 NA  5     0
7 3 NA NA     1
8 4  5  9     0
9 5 NA NA     0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果另一个数据帧中存在相同的行,如何删除Pandas数据帧中的行? - How to remove rows in a Pandas dataframe if the same row exists in another dataframe? 当两行的第一个值相同时,如何将一行的数据添加到另一行 - How to add data of a row to another row when two rows have the same first value 如何在另一行中制作具有相同值的行元组列表 - how to make list of tuples of rows with same value in another row 当时间戳高于同一组中具有值的行时,删除组中的行 - Drop rows in a group when timestamp is higher then a row with a value in the same group 如果数据值列中的空值已经存在于另一行中,则使用该值填充该值 - Fill in empty value in a dataframe column with the same value if it already exists in another row 熊猫将新行计算为存在相同列值的行之间的差异 - Pandas calculate new row as difference between rows with same column value where exists 如何根据列值删除行,其中某行的列值是另一行的子集? - How to remove rows based on a column value where some row's column value are subset of another? 当同一行中的数据添加到另一列时,计算一行的列值 - calculate value of a column for a row when data in same row is added to another column 当一行中某一列的值与另一行另一列中的值匹配时,如何匹配pyspark数据框中的两行? - How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row? 在python中与另一列具有相同值的行中选择具有最高列值的csv行 - Selecting csv row with the highest column value among rows with the same value of another column in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM