简体   繁体   English

通过较小的 tibble 对 tibble 进行子集

[英]Subset a tibble by a smaller tibble

I have two tibbles我有两个小东西

data
A tibble: 6,358,584 x 3
Date     Name       Key
<date>  <chr>      <chr>

treated_group
A tibble: 6,051 x 1  
 Key
 <chr>

The key identifies my treated group and I would like to subset the bigger tibble for all treated objects.键标识了我处理过的组,我想为所有处理过的对象对较大的 tibble 进行子集化。 However by using filter但是通过使用过滤器

data %>% filter(Key == treated_group)

I run into the error:我遇到了错误:

Error in filter_impl(.data, quo) : Result must have length 6358584, not 6051 I recognize that I can use filter only for a 1x1 and thus I would have used a workaround where I loop through the rows of treated_group and filter for the data for every row, but this is very inefficient and I would like to stay within the dplyr framework. Error in filter_impl(.data, quo) : Result must have length 6358584, not 6051我认识到我只能对 1x1 使用过滤器,因此我会使用一种解决方法,在其中循环处理待处理组的行并过滤数据对于每一行,但这效率非常低,我想留在 dplyr 框架内。

Any hint and help is appreciated!任何提示和帮助表示赞赏!

head(data)
#> # A tibble: 6 x 3
#>   TIMESTAMP_UTC ENTITY_NAME ENS_KEY                         
#>   <date>        <chr>       <chr>                           
#> 1 2000-01-04    3M Co.      E73F64B685D3E70AFE8DFC37C33825F7
#> 2 2000-01-04    3M Co.      62D1EE4BF4DF6EDD38F95E4033B4E687
#> 3 2000-01-05    3M Co.      24EFCCD1828DDBB164A7CDED15696EC9
#> 4 2000-01-05    3M Co.      62D1EE4BF4DF6EDD38F95E4033B4E687
#> 5 2000-01-10    3M Co.      BF24EB30E19607DD73C0BC51F9EF2DF4
#> 6 2000-01-10    3M Co.      940F168DB3203A028350BC4989EBDE17
head(treated_data)
#> # A tibble: 6 x 1
#>   ENS_KEY                         
#>   <chr>                           
#> 1 2CDDC73CD6247E41244EE82B3BD2AB14
#> 2 940F168DB3203A028350BC4989EBDE17
#> 3 1D9944BA5D170684910D3F5E56C2990B
#> 4 8431C047CFA3920042325B28B238E335
#> 5 606FAF396319C78ABC9CAD17C49E52D9
#> 6 3B277F9151290346EF7E05EC046121D9
filter(data,ENS_KEY %in% treated_data)
#> # A tibble: 0 x 3
#> # ... with 3 variables: TIMESTAMP_UTC <date>, ENTITY_NAME <chr>,
#> #   ENS_KEY <chr>

Created on 2019-07-31 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2019 年 7 月 31 日创建

As you can see entry 6 of my data and entry 2 of my treated_data match, but the output is an empty tibble!如您所见,我的数据的第 6 项和我的 Treated_data 匹配的第 2 项,但输出是一个空的 tibble!

How about something like this?这样的事情怎么样?

The pull function just takes the values in the column and puts them in to a vector. pull函数只是获取列中的值并将它们放入一个向量中。 You can use this with %in% when you filter.您可以%in%过滤时将其与%in%一起使用。

td <- treated_data %>% 
  pull #just gets the values

data %>% 
  filter(ENS_KEY %in% td)

and you get:你会得到:

# A tibble: 1 x 3
  TIMESTAMP_UTC ENTITY_NAME ENS_KEY                         
  <chr>         <chr>       <chr>                           
1 10/01/2000    3M Co.      940F168DB3203A028350BC4989EBDE17

Another option, which will give you the same result:另一个选项,它会给你相同的结果:

data %>% 
  inner_join(treated_data, by = "ENS_KEY")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM