简体   繁体   English

比嵌套R中的for循环更有效的方法

[英]More efficient methods than nested for loops in R — matching

I'm trying to match people when they have identical names, last names, and first names, and keep the smallest numerical value for IDs. 当人们具有相同的名字,姓氏和名字时,我试图匹配他们,并保持ID的最小数值。

I've created a test database below (much smaller than my actual dataset) and written a nested for-loop that looks like it's doing what it's supposed to. 我在下面创建了一个测试数据库(比我的实际数据集小得多),并编写了一个嵌套的for循环,看起来像它在做应该做的事情。

But it's slow as hell on bigger datasets. 但是,对于较大的数据集,它的运行速度实在是太慢了。

I'm relatively new to the apply functions, but they seem more intuitive for applying functions than data wrangling. 我对应用功能还比较陌生,但应用功能似乎比数据整理更为直观。

What's a more efficient alternative for what I'm doing here? 有什么比我在这里做的更有效的选择? I'm sure there's a simple solution that will have me shaking my head for asking here, but I'm not coming to it. 我敢肯定有一个简单的解决方案,可以让我摇头问在这里,但是我没有来。

dta.test<- NULL
dta.test$Person_id <- c(1,2,3,4,5,6,7,8,9,10, 11)
dta.test$FirstName <- c("John", "James", "John", "Alex", "Alexander", "Jonathan", "John", "Alex", "James", "John", "John")
dta.test$LastName <- c("Smith", "Jones", "Jones", "Jones", "Jones", "Smith", "Jones", "Smith", "Johnson", "Smith", "Smith")
dta.test$DOB <- c("2001-01-01", "2002-01-01", "2003-01-01", "2004-01-01", "2004-01-01", "2001-01-01", "2003-01-01", "2006-01-01", "2006-01-01", "2001-01-01", "2009-01-01")
dta.test$Actual_ID <- c(1, 2, 3, 4, 5, 6, 3, 8, 9, 1, 11)
dta.test <- as.data.frame(dta.test)

for(i in unique(dta.test$FirstName))
  for(j in unique(dta.test$LastName))
    for (k in unique (dta.test$DOB))
{
  {
    {
       dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
    }
  }
}

Here's a dplyr solution 这是dplyr解决方案

library(dplyr)
dta.test %>%
  group_by(FirstName, LastName, DOB) %>%
  mutate(Person_id = min(Person_id))

# A tibble: 11 x 5
# Groups: FirstName, LastName, DOB [9]
   # Person_id FirstName LastName DOB        Actual_ID
       # <dbl> <fct>     <fct>    <fct>          <dbl>
 # 1        1. John      Smith    2001-01-01        1.
 # 2        2. James     Jones    2002-01-01        2.
 # 3        3. John      Jones    2003-01-01        3.
 # 4        4. Alex      Jones    2004-01-01        4.
 # 5        5. Alexander Jones    2004-01-01        5.
 # 6        6. Jonathan  Smith    2001-01-01        6.
 # 7        3. John      Jones    2003-01-01        3.
 # 8        8. Alex      Smith    2006-01-01        8.
 # 9        9. James     Johnson  2006-01-01        9.
# 10        1. John      Smith    2001-01-01        1.
# 11       11. John      Smith    2009-01-01       11.

EDIT - Added Performance comparison 编辑 -添加了性能比较

for_loop_approach <- function() {
    for(i in unique(dta.test$FirstName))
      for(j in unique(dta.test$LastName))
        for (k in unique (dta.test$DOB))
    {
      {
        {
           dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
        }
      }
    }
}

dplyr_approach <- function() {
    require(dplyr)
    dta.test %>%
      group_by(FirstName, LastName, DOB) %>%
      mutate(Person_id = min(Person_id))
}

library(microbenchmark)
microbenchmark(for_loop_approach(), dplyr_approach(), unit="relative", times=100L)

Unit: relative
                expr      min      lq    mean   median       uq      max neval
 for_loop_approach() 20.97948 20.6478 18.8189 17.81437 17.91815 11.76743   100
    dplyr_approach()  1.00000  1.0000  1.0000  1.00000  1.00000  1.00000   100
There were 50 or more warnings (use warnings() to see the first 50)

I've implemented a base R approach rather than dplyr and it comes out (according to microbenchmark) 7.46 times faster than the dplyr approach of CPak, and 139.4 times faster than the for loop approach. 我已经实现了基本R方法而不是dplyr,它的结果(根据微基准测试)比CPak的dplyr方法快7.46倍,比for循环方法快139.4倍。 I've just used the match and paste0 functions to get this working, and it will automatically retain the smallest matching id: 我刚刚使用matchpaste0函数来使它工作,它将自动保留最小的匹配ID:

  dta.test[, "Actual_id"] <- match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB))

This approach also outputs it straight to a data frame, rather than a tibble (which you would need to extract the new column from, and add to your data frame): 此方法还将其直接输出到数据帧,而不是小标题(您需要从中提取新列并添加到数据帧中):

   Person_id FirstName LastName        DOB Actual_id
1          1      John    Smith 2001-01-01         1
2          2     James    Jones 2002-01-01         2
3          3      John    Jones 2003-01-01         3
4          4      Alex    Jones 2004-01-01         4
5          5 Alexander    Jones 2004-01-01         5
6          6  Jonathan    Smith 2001-01-01         6
7          7      John    Jones 2003-01-01         3
8          8      Alex    Smith 2006-01-01         8
9          9     James  Johnson 2006-01-01         9
10        10      John    Smith 2001-01-01         1
11        11      John    Smith 2009-01-01        11

In your real data I expect the person id is not so simple (not just an integer) and doesn't run in numerical order, eg 在您的真实数据中,我希望人员编号不是那么简单(不仅是整数)而且不会按数字顺序运行,例如

dta.test$Person_id <- paste0(LETTERS[1:11],1:11)

You just need a small tweak to make this still work, to make it extract value from the Person_id column: 您只需要进行一些细微的调整即可使其继续工作,以使其从Person_id列中提取值:

dta.test[, "Actual_id"] <- dta.test[match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB)), "Person_id"]

Giving: 赠送:

   Person_id FirstName LastName        DOB Actual_id
1         A1      John    Smith 2001-01-01        A1
2         B2     James    Jones 2002-01-01        B2
3         C3      John    Jones 2003-01-01        C3
4         D4      Alex    Jones 2004-01-01        D4
5         E5 Alexander    Jones 2004-01-01        E5
6         F6  Jonathan    Smith 2001-01-01        F6
7         G7      John    Jones 2003-01-01        C3
8         H8      Alex    Smith 2006-01-01        H8
9         I9     James  Johnson 2006-01-01        I9
10       J10      John    Smith 2001-01-01        A1
11       K11      John    Smith 2009-01-01       K11

A data table solution will probably be quickest on large data with lots of groups: 数据表解决方案对于包含多个组的大型数据可能是最快的:

library(data.table)
setDT(dta.test, key = c("FirstName", "LastName", "DOB"))
dta.test[, Actual_ID := min(Person_id, na.rm = TRUE), by = .(FirstName, LastName, DOB)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM