所選列與 R 中 data.frame 的其余部分之間的相關性

Question

我的數據大約有 270 列，其中 160.000 個主要是非數字觀察。

我需要找到列之間的模式和依賴關系。 例如，我需要將“材料”列與其他列相關聯。

Material   |  Name    |  Country | Vehicle    
----------------------------------------------
Bricks     |  John    |  A       | Car
Bricks     |  John    |  A       | Car
Bricks     |  John    |  A       | Motorcycles
Bricks     |  John    |  B       | Motorcycles
Concrete   |  Bill    |  B       | Car
Concrete   |  Bill    |  B       | Car
Concrete   |  Bill    |  B       | Car
Concrete   |  Bill    |  A       | Car

我想要的結果是：

Name    - 100% 
Country - 75%
Vehicle - 50%

我試過：

library("GoodmanKruskal")
Cor_matrix<- GKtauDataframe(df)
plot(Cor_matrix)

但得到：表中的錯誤（x，y，useNA = includeNA）：嘗試使用> = 2 ^ 31個元素制作表格

或者：

library("corrr")
df %>% correlate() %>% focus(Material)

stats::cor(x = x, y = y, use = use, method = method) 中的錯誤：'x' 必須是數字

所以我正在尋找可以處理非數字的包和代碼示例。 提前謝謝了。

Answer 1

您的代碼使用函數GKtauDataframe ，該函數嘗試同時計算所有 270 x 270 組合的指標。 那太多了。

但是，正如您所提到的，您希望將一列與所有其他列進行比較。 這應該是可行的，並且不需要那么多內存。 函數GKtau在一對列之間執行此操作：

GKtau(df[, 1], df[, 2])

要針對所有其他列獲取第一列的值，只需調用：

lapply(df[, -1], GKtau, df[, 1])

當然，您可以使用以下方法優化輸出：

sapply(df[, -1], function(di) GKtau(df[, 1], di)$tauxy)

這使得輸出方式更加緊湊。

Answer 2

如果df中的列是factor類型，則需要先轉換為數字。

df[] <- Map(as.numeric,df)

除此以外

df[] <- Map(function(v) as.numeric(factor(v)),df)

然后，您可以運行以下代碼

df %>% correlate() %>% focus(Material)

# A tibble: 3 x 2
  rowname Material
  <chr>      <dbl>
1 Name      -1    
2 Country    0.5  
3 Vehicle   -0.577

數據

df <- structure(list(Material = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("Bricks", "Concrete"), class = "factor"), 
    Name = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Bill", 
    "John"), class = "factor"), Country = structure(c(1L, 1L, 
    1L, 2L, 2L, 2L, 2L, 1L), .Label = c("A", "B"), class = "factor"), 
    Vehicle = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Car", 
    "Motorcycles"), class = "factor")), class = "data.frame", row.names = c(NA, 
-8L))

所選列與 R 中 data.frame 的其余部分之間的相關性

問題描述

2 個解決方案

解決方案1
0 2020-02-25 14:59:32

解決方案2
0 已采納 2020-02-25 15:06:37

所選列與 R 中 data.frame 的其余部分之間的相關性

問題描述

2 個解決方案

解決方案1 0 2020-02-25 14:59:32

解決方案2 0 已采納 2020-02-25 15:06:37

解決方案1
0 2020-02-25 14:59:32

解決方案2
0 已采納 2020-02-25 15:06:37