簡體   English   中英

R 中使用 reshape2 包和 dcast 的段錯誤

[英]segfault in R using reshape2 package and dcast

當我嘗試使用dcast (來自reshape2包)重塑特定數據框時,RStudio 崩潰了。 我發現崩潰實際上發生在 R 本身,所以我在 R.app 中運行我的轉換代碼並得到了給這個站點命名的錯誤類型: Error: segfault from C stack overflow 在谷歌和SO的幫助下,我了解到這是一個內存訪問錯誤。

好吧,我已經走了那么遠,但我不知道從哪里開始。 我無法提供真正可重現的示例,因為我的數據框大約有 558,000 行,並且問題不會出現在小型玩具示例上。 例如,即使我采用了 50,000 行的數據子集, dcast也能正常工作。 是否存在導致問題的特定數據行? 如果是這樣,任何人都可以建議尋找哪些功能可能會導致我遇到的錯誤類型?

這是我正在轉換的數據框的一個子集(一些變量的假值),然后是我正在使用的轉換函數。 我還在下面的dput函數中包含了這個小數據片段,以防它會有所幫助。 真實數據集大約有 700 個prog值、15 個prog1值和 5 個fa.type值。

  id        term   yr    nslds acad.lev    prog            prog1 fa.type amount
1  1   Fall 2009 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
2  1 Spring 2010 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
3  2   Fall 2009 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
4  2 Spring 2010 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
5  3   Fall 2007 2008 Graduate Graduate  loan 3    Stafford Loan    Loan   4250
6  3   Fall 2007 2008 Graduate Graduate grant 1 University Grant   Grant   1707

fa.wide = dcast(id + term + yr + nslds + acad.lev ~ prog1 + fa.type , data=fa, value.var="amount", fun.aggregate=sum)

fa = structure(list(id = c(1, 1, 2, 2, 3, 3), term = structure(c(7L, 
8L, 7L, 8L, 1L, 1L), .Label = c("Fall 2007", "Spring 2008", "Summer 2008", 
"Fall 2008", "Spring 2009", "Summer 2009", "Fall 2009", "Spring 2010", 
"Summer 2010", "Fall 2010", "Spring 2011", "Summer 2011", "Fall 2011", 
"Spring 2012", "Summer 2012", "Fall 2012", "Spring 2013"), class = c("ordered", 
"factor")), yr = c(2010L, 2010L, 2010L, 2010L, 2008L, 2008L), 
    nslds = structure(c(7L, 7L, 7L, 7L, 7L, 7L), .Label = c("1st Year, Never Attended", 
    "1st Year, Previously Attended", "2nd Year", "3rd Year", 
    "4th Year", "5th Year+", "Graduate"), class = c("ordered", 
    "factor")), acad.lev = structure(c(6L, 6L, 6L, 6L, 6L, 6L
    ), .Label = c("Freshman", "Sophomore", "Junior", "Senior", 
    "PB Undergrad", "Graduate"), class = c("ordered", "factor"
    )), prog = c("loan 1", "loan 1", "loan 2", "loan 2", "loan 3", 
    "grant 1"), prog1 = c("Other Loans", "Other Loans", "Stafford Loan", 
    "Stafford Loan", "Stafford Loan", "University Grant"), fa.type = structure(c(3L, 
    3L, 3L, 3L, 3L, 2L), .Label = c("Athletic", "Grant", "Loan", 
    "Scholarship", "Waiver", "Work/Study"), class = "factor"), 
    amount = c(5000, 5000, 8781, 8781, 4250, 1707)), .Names = c("id", 
"term", "yr", "nslds", "acad.lev", "prog", "prog1", "fa.type", 
"amount"), row.names = c(NA, 6L), class = "data.frame")

這不是一個答案,而是一個簡單的(無意義的)可重現的例子,不適合評論。 你可以用這個簡單的例子(在我的 MacBookPro 上)重現這個錯誤。

require(reshape2)
n = 1448
df <- data.frame( Student = rep( 1:n , each = 2 ) , Grade = sample( 100 , n*2 , repl = TRUE ) )
df2 <- dcast( df , Student ~ Student , value.var = "Grade" , sum )
Error: segfault from C stack overflow

錯誤發生在邊界n = 1448 ,即當n=1447及以下時不會發生。 看來,錯誤是來自split_indicessplit-numeric.c從包裝plyr 這可能與分組級別數分配給(無符號?)整數值有關,如果組數超過 32767,則會導致內存訪問錯誤,但 TBH 我現在正抓着稻草.

我的sessionInfo()以防任何人無法重新創建此錯誤:

R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.2.2

loaded via a namespace (and not attached):
[1] plyr_1.8      stringr_0.6.2

有趣的是,如果我在收到第一個錯誤后再次運行df2 <-命令,R 會完全崩潰並且我會收到一些操作系統生成的錯誤報告。 我在此處包含崩潰日志的相關部分:

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_PROTECTION_FAILURE at 0x00007fff5f3ff120

VM Regions Near 0x7fff5f3ff120:
    JS JIT generated code  00004d431a401000-00004d431a402000 [    4K] ---/rwx SM=NUL  
--> STACK GUARD            00007fff5bc00000-00007fff5f400000 [ 56.0M] ---/rwx SM=NUL  stack guard for thread 0
    Stack                  00007fff5f400000-00007fff5fc00000 [ 8192K] rw-/rwx SM=COW  thread 0

Application Specific Information:
objc[57147]: garbage collection is OFF

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_c.dylib               0x00007fff897c4632 small_free_scan_madvise_free + 41
1   libsystem_c.dylib               0x00007fff897c5f06 szone_free_definite_size + 4186
2   libsystem_c.dylib               0x00007fff897fe789 free + 194
3   libR.dylib                      0x0000000100222dbf R_gc_internal + 7327 (memory.c:952)
4   libR.dylib                      0x0000000100224919 Rf_allocVector + 841 (memory.c:2356)
5   plyr.so                         0x000000010144bd2c split_indices + 204 (split-numeric.c:23)
6   libR.dylib                      0x00000001001b4cc7 do_dotcall + 16311 (dotcode.c:593)
7   libR.dylib                      0x00000001001e4448 Rf_eval + 1672 (eval.c:494)
8   libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
9   libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
10  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
11  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
12  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
13  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
14  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
15  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
16  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
17  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
18  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
19  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
20  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
21  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
22  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
23  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
24  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
25  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
26  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
27  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
28  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
29  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
30  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
31  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
32  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
33  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
34  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
35  libR.dylib                      0x000000010021c761 R_ReplDLLdo1 + 481 (main.c:362)
36  org.R-project.R                 0x0000000100022c24 run_REngineRmainloop + 196
37  org.R-project.R                 0x00000001000159b7 -[REngine runREPL] + 119
38  org.R-project.R                 0x0000000100001f24 main + 852
39  org.R-project.R                 0x0000000100001914 start + 52

我在使用 reshape2 包中的 dcast 將長表旋轉為寬表時遇到了同樣的問題。 我在這篇 post plyr split_indices function crashes for long vectors 中找到了解決方案。 具體來說,您可以在此頁面https://github.com/hadley/plyr/tree/master/src下載 split_numeric.c 和 loop-apply.c 。 從 R 控制台卸載包 plyr,最后在本地重新安裝包: install.packages('/path/to/source', repos=NULL, type='source')。

這解決了我的問題,希望它有所幫助。

只是為了結束這個老問題,這是一個已修復的錯誤,如此 github 問題中所述

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM