以 R 為基礎的 Dcast 重復

Question

我有名為 my_data 的數據。 數據量 > 100000. 示例輸出如下

id                      source
    8166923397733625478 happimobiles
    8166923397733625478 Springfit
    7301100145962413274 Duroflex
    6703062895304712434 happimobiles
    6897156268457025524 themrphone
    37564799155342281   Sangeetha Mobiles
    1159098248970201145 Sangeetha Mobiles

我使用了下面的代碼和表（my_data）。

library("readxl")
my_data <- read_excel("C:\\Users\\ashishpatodia\\Desktop\\R\\Code\\Sample_Data_Overlap.xlsx",sheet = "10000 sample")

setDT(my_data)
(cohorts <- dcast(unique(my_data)[,cohort:=(source),by=id],cohort~ source, fun.aggregate=length, value.var="cohort"))

我想要輸出，其中每個 id 都應該被計算在 source 下，並且在重復的 Ex ID 下以 5478 結尾屬於 happimobiles 和 springfit。 所以 happimobiles 的 id 為 8166923397733625478 和 6703062895304712434，這使得它 2 和 1 與 springfit 很常見。

輸出

                   happimobiles   Springfit  Duroflex themrphone   Sangeetha    
happimobiles         2                1        0          0            0
Springfit            1                1        0          0            0
Duroflex             0                0        1          0            0  
themrphone           0                0        0          1            0
Sangeetha            0                0        0          0            1

我也試過

Pivot<-dcast(my_data,source~source,value.var = "id",function(x) length((x)))

這僅正確地給了我特定合作伙伴中的唯一記錄，但不重疊。

我也試過

crossprod(table(my_data))

但這並沒有給出正確答案

鏈接到整個數據

https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0我希望代碼運行

Answer 1

我們可以使用來自base R帶有crossprod table

crossprod(table(my_data))
#            source
#source              Duroflex happimobiles Sangeetha Mobiles Springfit themrphone
#  Duroflex                 1            0                 0         0          0
#  happimobiles             0            2                 0         1          0
#  Sangeetha Mobiles        0            0                 2         0          0
#  Springfit                0            1                 0         1          0
#  themrphone               0            0                 0         0          1

數據

my_data <- structure(list(id = c(8166923397733625856, 8166923397733625856, 
7301100145962413056, 6703062895304712192, 6897156268457025536, 
37564799155342280, 1159098248970201088), source = c("happimobiles", 
"Springfit", "Duroflex", "happimobiles", "themrphone", "Sangeetha Mobiles", 
"Sangeetha Mobiles")), class = "data.frame", row.names = c(NA, 
-7L))

以 R 為基礎的 Dcast 重復

問題描述

1 個解決方案

解決方案1
1 2019-11-28 18:28:53

數據

以 R 為基礎的 Dcast 重復

問題描述

1 個解決方案

解決方案1 1 2019-11-28 18:28:53

數據

解決方案1
1 2019-11-28 18:28:53