简体   繁体   English

在数据框行之间查找唯一值并将其替换(R)

[英]Finding unique values between rows of data frame and replacing them (R)

I have nested data, with ID numbers for within- and cluster level observations. 我嵌套了数据,其中包含ID编号,用于进行内部和群集级别的观察。 Let's call them L1IDs and L2IDs. 我们称它们为L1ID和L2ID。

L1ID <- c(1,2,3,4,5,6)
L2ID <- c(11,11,22,22,33,33)

And for both I have a number of variables. 对于这两者,我都有许多变量。 We'll call them L1X's and L2X's 我们称它们为L1X和L2X

L1X1 <- rnorm(6,3,1.1)
L1X2 <- rnorm(6,0,.7)
L2X1 <- c(0,1,1,1,0,0)
L2X2 <- c(Blue,Blue,Red,Red,Green,Red)

Combining the vectors into a dataframe: 将向量合并到数据帧中:

df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2)
df

I have a problem. 我有个问题。 The values for the 11 and 33 L2ID are not identical. 11和33 L2ID的值不相同。 ID 11 has a 1 for the 2nd entry under L2X1 when it should be 0, and ID 33 has Red in the last entry for L2X2 when it should be Green. 当ID 11应该为0时,L2X1下第二个条目的ID为1,而ID 33应该为绿色时,L2X2的最后一个条目的ID 33为红色。

L1X values should be different within cluster but not the L2Xs. 群集内的L1X值应不同,但L2X不应相同。 I need a way to search a large data base by L2ID and find column values that are not identical. 我需要一种通过L2ID搜索大型数据库并查找不相同的列值的方法。 Then, replace them with a chosen value. 然后,将它们替换为选定的值。 Ideally, this would be a dataframe where each L2ID is a single row and then each column is a logic vector that says True or False if all values in that column, for that L2ID, match. 理想情况下,这将是一个数据帧,其中每个L2ID是单个行,然后每一列都是逻辑向量,如果该列中的所有值都与该L2ID匹配,则说True或False。 And then replace them all with a same value. 然后将它们全部替换为相同的值。 So, for ID 11, I need to be able to see that L2X1 does not match for all subjects clustered within it, and that I can replace the 1 with a 0, but that L2X2 all match. 因此,对于ID 11,我需要能够看到L2X1不适用于其中聚集的所有主题,并且我可以将1替换为0,但是L2X2都可以匹配。

Does that make sense? 那有意义吗?

My actual dataset (licensed access so I cannot share) is rather large and manually searching this thing for where values do not match is a pain. 我的实际数据集(获得许可的访问权限,所以我不能共享)相当大,而手动搜索值不匹配的东西很麻烦。

So far, my approach has been to eliminate all L1X variables, use dplyr's distinct() function to reduce each row to unique combinations of the L2X variables (each L2ID typically has 2 unique combinations), and then manually searching for discrepancies. 到目前为止,我的方法是消除所有L1X变量,使用dplyr的distinct()函数将每一行缩小为L2X变量的唯一组合(每个L2ID通常具有2个唯一组合),然后手动搜索差异。 Often it's a decimal point in the wrong place. 通常它在错误的位置是小数点。

Update: 更新:

To make these sample data more representative of what I am working with, I changed L2X2 to a character vector and added in a 3rd L2ID. 为了使这些样本数据更能代表我的工作,我将L2X2更改为字符向量,并添加了第三个L2ID。 Also, I nearly have 200 columns and 9,000 L2IDs (and since most are doubled, it gets to be about 18,000 obs). 另外,我几乎有200列和9,000个L2ID(并且由于大多数已加倍,因此大约为18,000 obs)。 I'm trying to find a way to not manually specify each column when searching if their values matched. 我试图找到一种方法来搜索它们的值是否匹配时不手动指定每列。 Tried something like the following: 尝试了以下内容:

df %>% group_by(L2ID) %>% sapply(identical())

But I have never used the identical() function in Base R so this didn't work. 但是我从未在Base R中使用过same()函数,所以这行不通。 And still working through what to do next. 并仍在努力下一步。 I appreciate the responses so far; 我感谢到目前为止的答复; I'm going to keep working through this as we go. 我将继续努力。

I make no promises on performance, but this is one solution, which takes advantage of the rle (run length encoding) function in R. This, of course, assumes that the example data you provided properly implies that the value should be replaced with the most common value in that group. 我对性能不做任何保证,但这是一种解决方案,它利用了R中的rle (游程编码)功能。当然,这假定您正确提供的示例数据暗示该值应替换为该组中最常见的价值。

> L1ID <- c(1,2,3,4,5,6)
> L2ID <- c(11,11,11,22,22,22)
> L1X1 <- rnorm(6,3,1.1)
> L1X2 <- rnorm(6,0,.7)
> L2X1 <- c(0,0,1,1,1,1)
> L2X2 <- c(13,13,13,8,8,9)
> df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2)
> df
  L1ID L2ID      L1X1         L1X2 L2X1 L2X2
1    1   11 1.9155828  0.287683782    0   13
2    2   11 2.8383669 -0.693942886    0   13
3    3   11 4.7517203  0.419193550    1   13
4    4   22 2.0092141  0.002223136    1    8
5    5   22 1.2546399 -0.457323727    1    8
6    6   22 0.8622906  0.255975868    1    9

> df %>%
     group_by(L2ID) %>%
     mutate(L2X1_r = rle(L2X1)$values[rle(L2X1)$lengths == max(rle(L2X1)$lengths)],
            L2X2_r = rle(L2X2)$values[rle(L2X2)$lengths == max(rle(L2X2)$lengths)]) %>%
     ungroup()
# A tibble: 6 x 8
   L1ID  L2ID      L1X1         L1X2  L2X1  L2X2 L2X1_r L2X2_r
  <dbl> <dbl>     <dbl>        <dbl> <dbl> <dbl>  <dbl>  <dbl>
1     1    11 1.9155828  0.287683782     0    13      0     13
2     2    11 2.8383669 -0.693942886     0    13      0     13
3     3    11 4.7517203  0.419193550     1    13      0     13
4     4    22 2.0092141  0.002223136     1     8      1      8
5     5    22 1.2546399 -0.457323727     1     8      1      8
6     6    22 0.8622906  0.255975868     1     9      1      8

Update 更新资料

Based on the comments and updated question, I've realized that rle won't work because it assumes the "majority" value has a long run length encoding. 根据评论和更新的问题,我意识到rle将不起作用,因为它假定“多数”值具有较长的运行长度编码。 This approach fixes this issue, as well as introduces a way to not have to specify every column to be mutated manually. 此方法解决了此问题,并引入了一种不必指定要手动进行突变的每一列的方法。

> L1ID <- c(1,2,3,4,5,6)
> L2ID <- c(11,11,22,22,33,33)
> L1X1 <- rnorm(6,3,1.1)
> L1X2 <- rnorm(6,0,.7)
> L2X1 <- c(0,1,1,1,0,0)
> L2X2 <- c('Blue','Blue','Red','Red','Green','Red')
> df <- data.frame(L1ID,L2ID,L1X1,L1X2,L2X1,L2X2, stringsAsFactors=F)
> df
  L1ID L2ID     L1X1        L1X2 L2X1  L2X2
1    1   11 4.058659  0.12423215    0  Blue
2    2   11 2.922632  0.30954205    1  Blue
3    3   22 2.719407 -0.33382402    1   Red
4    4   22 1.981046 -0.63617811    1   Red
5    5   33 2.570058 -1.39886373    0 Green
6    6   33 4.471551 -0.05489082    0   Red

> replace_with_right_value = function(col) {
+     tbl = table(col)
+     names(tbl)[tbl == max(tbl)]
+ }

> df %>%
     group_by(L2ID) %>%
     mutate_at(vars(matches('L2X')), replace_with_right_value)
     ungroup()
# A tibble: 6 x 6
   L1ID  L2ID     L1X1        L1X2  L2X1  L2X2
  <dbl> <dbl>    <dbl>       <dbl> <chr> <chr>
1     1    11 4.058659  0.12423215     0  Blue
2     2    11 2.922632  0.30954205     1  Blue
3     3    22 2.719407 -0.33382402     1   Red
4     4    22 1.981046 -0.63617811     1   Red
5     5    33 2.570058 -1.39886373     0 Green
6     6    33 4.471551 -0.05489082     0   Red

The replace_with_right_value function takes in a column and returns the most common element in that vector. replace_with_right_value函数接受一列,并返回该向量中最常见的元素。 mutate_at allows you to specify which columns to select, which is done via vars(matches('L2X')) . mutate_at允许您指定要选择的列,这可以通过vars(matches('L2X')) If the columns do not follow this pattern, you'll need to modify that string a bit. 如果列不遵循此模式,则需要稍微修改该字符串。 Matches accepts a regular expression, which should prove very helpful in this case. Matchs接受一个正则表达式,这在这种情况下应该非常有用。 In this case of L2ID , there is not enough information in the question or the data to determine which value to choose for L2X1 when L2ID == 11 or L2X2 when L2ID == 33 . 在这种情况下L2ID ,没有足够的信息中的问题或数据,以确定选择哪个值L2X1L2ID == 11L2X2L2ID == 33 As a result, it returns both. 结果,它都返回。 To force it to choose a value, such as the first one, change the function to return names(tbl)[tbl == max(tbl)][1] 要强制其选择一个值(例如第一个值),请将函数更改为返回names(tbl)[tbl == max(tbl)][1]

Here we check if L2X1 is consistent for L2ID . 在这里,我们检查L2X1是否与L2ID一致。 You can easily add another column using this logic to check L2X2 as well. 您也可以使用此逻辑轻松添加另一列来检查L2X2 We simply check if the min and max value of each L2ID is equal, if if those values are not equal, we replace with the min value in L2X1_Fixed . 我们只是检查minmax的每个值L2ID是平等的,如果如果这些值不相等,我们与替换min价值L2X1_Fixed

df %>% group_by(L2ID) %>% mutate(Test= ifelse(min(L2X1)==max(L2X1), TRUE,FALSE)) %>%
      mutate(L2X1_Fixed = ifelse(Test ==FALSE, min(L2X1), L2X1))

# A tibble: 6 x 8
# Groups:   L2ID [2]
   L1ID  L2ID     L1X1        L1X2  L2X1  L2X2  Test L2X1_Fixed
  <dbl> <dbl>    <dbl>       <dbl> <dbl> <dbl> <lgl>      <dbl>
1     1    11 2.355470 -1.53195614     0    13 FALSE          0
2     2    11 3.784859  0.20900278     0    13 FALSE          0
3     3    11 3.339077 -0.19772481     1    13 FALSE          0
4     4    22 2.512764  0.18222493     1     8  TRUE          1
5     5    22 1.176079  0.04175856     1     8  TRUE          1
6     6    22 3.688449 -0.42174624     1     9  TRUE          1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM