簡體   English   中英

用lapply代替for循環

[英]lapply instead of for loop

我有以下巨大的數據框:

V1  V2  V3  V4
A   E   R   12
A   R   T   18
A   T   Y   44
A   Y   U   11
B   E   R   22
B   R   T   53
B   T   Y   11
B   Y   U   153 

我想做的是從(V1,V2)對中的V4獲取離群值

通過基於V1V2的唯一值以及每個回合的subset 2個for循環,可以輕松處理此問題,獲取每個子集的V4向量,並使用outlier軟件包的任何函數獲得outlier ,但是問題是速度。

我從來沒有使用過lapply ,也許有人可以指導我使用for循環的lapply來有效地執行此操作。

這是一個data.table解決方案:

對於近450萬行(每組676個組和6500條記錄),僅需2秒鍾多一點的時間(包括數據生成)。

library(outliers)
library(data.table)

# Fake data generation and coercion to data.table
d <- as.data.table(expand.grid(x=LETTERS, y=LETTERS, z=LETTERS))
d <- do.call(rbind, replicate(250, d, FALSE))

# > d
#          x y z      value     row
#       1: A A A -1.1712284       1
#       2: B A A  0.1818000       2
#       3: C A A -1.3959594       3
#       4: D A A -0.4778956       4
#       5: E A A -2.0426768       5
#      ---                         
# 4393996: V Z Z  0.4024398 4393996
# 4393997: W Z Z  0.9891237 4393997
# 4393998: X Z Z  1.2066572 4393998
# 4393999: Y Z Z  2.3023321 4393999
# 4394000: Z Z Z -0.8343059 4394000

# Add random "value" column and a column to keep track of row numbers
d[, c('value', 'row'):=list(rnorm(nrow(d)), seq_len(nrow(d)))]

# For each group (combination of x and y), perform the outlier test
outliers <- d[, chisq.out.test(value), list(x, y)]

# Add the row numbers for min and max numbers of each group
outliers <- merge(outliers, 
                  d[, list(min.ind=row[which.min(value)], 
                           max.ind=row[which.max(value)]), list(x, y)], 
                  by=c('x', 'y'))

# Create a new outlier column. If the p.value is >= 0.05, set outlier = NA,
# else if p.value < 0.5, then if "alternative" column contains "lowest", set
# outlier = min.ind, else max.ind.
outliers[, outlier:=ifelse(p.value < 0.05, 
                  ifelse(grepl('lowest', outliers[, alternative]), min.ind, max.ind), 
                  NA)]

輸出如下所示:

# > outliers
#      x y statistic                                  alternative      p.value                       method
#   1: A A  13.69290 highest value 3.70310786094858 is an outlier 2.152665e-04 chi-squared test for outlier
#   2: A B  11.99842 lowest value -3.47397308041372 is an outlier 5.324581e-04 chi-squared test for outlier
#   3: A C  12.41749 highest value 3.49833131757565 is an outlier 4.253310e-04 chi-squared test for outlier
#   4: A D  16.18416 lowest value -4.00696031141966 is an outlier 5.747273e-05 chi-squared test for outlier
#   5: A E  12.32196 lowest value -3.56650649267448 is an outlier 4.476613e-04 chi-squared test for outlier
#  ---                                                                                                     
# 672: Z V  11.66230 lowest value -3.43256736243089 is an outlier 6.377944e-04 chi-squared test for outlier
# 673: Z W  14.11816 highest value 3.75476979294983 is an outlier 1.716780e-04 chi-squared test for outlier
# 674: Z X  15.63605 highest value 3.93390421620766 is an outlier 7.677674e-05 chi-squared test for outlier
# 675: Z Y  17.05664 lowest value -4.12928000349912 is an outlier 3.628127e-05 chi-squared test for outlier
# 676: Z Z  14.44709 lowest value -3.82794835873449 is an outlier 1.441520e-04 chi-squared test for outlier
#      data.name min.ind max.ind outlier
#   1:     value 3609165 1191113 1191113
#   2:     value  105483 3476019  105483
#   3:     value 4153397 1375713 1375713
#   4:     value 3406443 2539135 3406443
#   5:     value   25117 2004445   25117
#  ---                                  
# 672:     value 1871740 2551796 1871740
# 673:     value 1003782 2158390 2158390
# 674:     value 1555424 1492556 1492556
# 675:     value 2071914 1344538 2071914
# 676:     value 2281500  426556 2281500

也許有點客氣,但是,嘿,它最終使我們到達了那里。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM