简体   繁体   English

识别重复行的组并保留组顺序

[英]Identify groups of repeat rows and preserve group order

I am trying to organize a spreadsheet of patient data with random repeat "chunks". 我正在尝试使用随机重复的“块”来组织患者数据电子表格。 Unfortunately, the rows of data are repeated at random, giving me repeat "chunks." 不幸的是,数据行是随机重复的,给了我重复的“块”。 I need to remove the repeat chunks while preserving the original order. 我需要在保留原始顺序的同时删除重复的块。

Here is a sample: 这是一个示例:

+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x       | 30  | g        |
| x       | 30  | b        |
| x       | 30  | g        |
| x       | 30  | b        |
| x       | 30  | g        |
| x       | 30  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
+---------+-----+----------+

You can see, patient "x" chunk (with 2 children) is repeated three times, and patient "y" chunk (with 3 children) is repeated twice. 您可以看到,患者“ x”块(带有2个孩子)被重复了3次,而患者“ y”块(带有3个孩子)被重复了两次。 The number of repeat chunks is random. 重复块的数量是随机的。

Here is my goal: It is important that the order of the children is preserved 这是我的目标:保持孩子们的秩序很重要

+---------+-----+----------+
| patient | age | children |
+---------+-----+----------+
| x       | 30  | g        |
| x       | 30  | b        |
| y       | 25  | g        |
| y       | 25  | b        |
| y       | 25  | b        |
+---------+-----+----------+

I tried this first in excel: step 1: gave all rows unique identifier, to preserve the order of the children step 2: tried to remove duplicates, but this was a problem for patient "y" who has 2 girls, the final table removed one of them... 我首先在excel中尝试了此步骤:步骤1:为所有行提供唯一的标识符,以保留子代的顺序步骤2:尝试删除重复项,但这对拥有2个女孩的患者“ y”来说是个问题,最终表已删除其中之一...

I usually do my analysis in R, so a dplyr solution would be great here if anyone could make a suggestion 我通常在R中进行分析,因此如果有人可以提出建议,在这里dplyr解决方案将是不错的选择

Beyond the following, I'm lost. 除了以下内容,我迷路了。 Is there a way to recognize unique groups? 有没有办法识别独特的群体?

dat %>% group_by(patient)

The distinct() function in dplyr might be your best bet; dplyr中的distinct()函数可能是最好的选择。 eg: 例如:

dat %>% distinct()

You can find more information on identifying and removing duplicate data in R by reading this blog post . 通过阅读此博客文章,您可以找到有关在R中标识和删除重复数据的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM