根據數據框的列值 R 對數據框行中的所有值求和並求平均值

Question

我有創建我想要的 output 的代碼； 但是，它非常緩慢。 我有兩個輸入數據集（ metaClustering_perCell ， data_clean ）。 data_clean 的每一行索引對應metaClustering_per cell 的索引 position。 這是兩個數據集的示例。

dput(head(data_clean[1:5],10))

structure(
  list(
    `NA` = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
    EGFP.A = c(326, 314, 341, 0, 198, 295, 325, 309, 400, 328),
    CD43.PE.A = c(435, 402, 469, 283, 303, 371, 442, 363, 444, 358),
    CD45.PE.Vio770.A = c(399, 385, 379, 438, 384, 331, 402, 392, 354, 430),
    CD235a_41a.APC.A = c(412, 618, 239, 562, 661, 193, 363, 385, 408, 265),
    APC.Vio770.A = c(447, 491, 444, 437, 477, 328, 453, 326, 353, 0)
  ),
  row.names = c(NA, -10L),
  class = "data.frame"
)

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0

dput(head(metaClustering_perCell,10))

c("1 Population", "1 Population", "1 Population", "1 Population", "1 Population",
"1 Population", "1 Population", "1 Population", "1 Population", "9 Population")

我希望最終使用標記的平均值（EGFP.A、CD43.PE.A .....）制作熱圖，但是，我的數據集將包含近 2e8 個細胞，這些細胞被分類到預定數量的群體中。 我編寫的代碼顯示在這里，它創建了 2 個空數據幀。 df_sum 存儲標記（EGFP.A、CD43.PE.A .....）的運行總和，而 df_count 對每個群體中的總事件進行運行統計。 最終，代碼通過將 dataframe 除以向量來取平均值。 代碼在這里。

# create an empty matrix
df_sum  <- data.frame(matrix(ncol = length(data_clean), nrow = num_clusters))
pops_header <- unique(metaClustering_perCell)
rownames(df_sum) <- pops_header
colnames(df_sum) <- colnames(data_clean)

# creates empty table for storing the count values
df_count <- data.frame(matrix(ncol = num_clusters, nrow = 1))
colnames(df_count) <- pops_header



df[is.na(df_sum)] <- 0
df_count[is.na(df_count)] <- 0



for (i in 1:length(metaClustering_perCell)){

  # only takes one row at a time of original data
  volt_vals <- data_clean[i,]
  
  # find the column to place it in (population)
  pop <- metaClustering_perCell[i]
  
  # Tally for each population
  df_count[1,pop] <- df_count[1,pop] + 1
  
  # adds to the previous value in the dataframe
  for (a in colnames(volt_vals)){
    df_sum[pop, a] <- volt_vals[a] + df_sum[pop, a]
  }
    
  # creates another dataframe same size as df to overwrite with the averages
  df_aves <- df_sum
  
  
  # Divide the df_=
  for (n in pops_header){
    df_aves[n,] <- mapply('/', df_sum[n,], df_count[n])
  }
}

我得到的 output 是這個（我剪掉了它們，以便更容易看到）

>head(df_sum[1:3],10)

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A
1 人口	26062897	35936578	32784372。
9 人口	1045468	1591084	1576716。
2 人口	4374137	8673145	6555053。
8 人口	818413	44836	1318176。
5 人口	217605	443341	439357。
6 人口	1056157	1558711	43206。
7 人口	747037	883763	1134664。
3 人口	1561994	2376586	2329772。
4 人口	54940	9346	137085。
10 人口	172735	213079	8043。

>head(df_count[1:5])

人口 9	人口 2	人口 8	人口 5	人口
78909	4262	12982	4447	1392

> head(df_aves[1:3], 10)

不適用	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A
1 人口	330.2905	455.41799	415.470631
9 人口	245.2999	373.31863	369.947443
2 人口	336.9386	668.09005	504.933986
8 人口	184.0371	10.08230	296.419159
5 人口	156.3254	318.49210	315.630029
6 人口	235.1195	346.99711	9.618433
7 人口	186.1079	220.17015	282.676632
3 人口	256.1906	389.79597	382.117763
4 人口	160.1749	27.24781	399.664723
10 人口	201.5578	248.63361	9.385064

每個人口的平均值數據框及其每個列標題（標記）的值正是我想要的......但是，它非常緩慢......我的意思是殘酷的。 這是我使用 R 的第一周（我從堆棧中知道自學 python），所以請徹底解釋。 謝謝您的幫助。

Answer 1

目前尚不清楚您要實現的具體目標，並且示例數據太稀疏而無法幫助消除歧義，但這是我的兩個猜測：

每個群體中每個標記的平均值

這種解釋與您的樣本 output 最一致，其中每個總體（集群）僅出現一次，就好像數據是按總體聚合的一樣。

在 R 中非常簡單地對數據進行分組，然后使用聚合函數對其進行匯總。

解決方案1.1： `dplyr`

這是dplyr package 的解決方案，它在語法上很直觀：

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and summarize the average of each marker (column).
  summarize(across(everything(), mean))

解決方案1.2： `data.table`

這是data.table的解決方案，它提供了更好的性能。

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Summarize the average of each marker (column), as grouped by cluster.
  lapply(.SD, mean), by = `NA`
]

結果

讓data_clean和metaClustering_perCell的值在您的問題中采樣。

雖然第一個結果（ 1.1 ）將是一個tibble ，第二個（ 1.2 ）一個data.table ，每個都將包含以下數據：

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

每次觀察的累積平均值 ("")

這種解釋與您的算法最一致，該算法似乎在運行的基礎上為每個觀察（行）計算其指標（平均值等）。

R 還有助於累積平均值、總和等。 利用向量化操作比為每一行迭代地計算這些指標（使用循環、 *apply()系列等）要高效得多。

解決方案2.1： `dplyr`

巧合的是， dplyr已經有了自己的cummean() function。

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and overwrite each marker (column) with its running average.
  mutate(across(everything(), cummean))

解決方案2.2： `data.table`

使用data.table我們可以即興創作我們自己的（匿名）function

function(x) {
  cumsum(x) / seq_along(x)
}

它將運行總和除以運行計數，以計算沿向量（列）的累積平均值。 我們還可以導入dplyr並使用cummean代替我們的 function。

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Overwrite each marker (column) with its running average, as grouped by cluster.
  lapply(.SD, function(x)cumsum(x)/seq_along(x)), by = `NA`
]

結果

讓data_clean和metaClustering_perCell的值在您的問題中采樣。

雖然第一個結果（ 1.1 ）將是一個tibble ，第二個（ 1.2 ）一個data.table ，每個都將包含以下數據：

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 326.0000  435.0000         399.0000         412.0000     447.0000
1 Population 320.0000  418.5000         392.0000         515.0000     469.0000
1 Population 327.0000  435.3333         387.6667         423.0000     460.6667
1 Population 245.2500  397.2500         400.2500         457.7500     454.7500
1 Population 235.8000  378.4000         397.0000         498.4000     459.2000
1 Population 245.6667  377.1667         386.0000         447.5000     437.3333
1 Population 257.0000  386.4286         388.2857         435.4286     439.5714
1 Population 263.5000  383.5000         388.7500         429.1250     425.3750
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

根據數據框的列值 R 對數據框行中的所有值求和並求平均值

問題描述

1 個解決方案

解決方案1
0 2022-01-12 01:25:27

每個群體中每個標記的平均值

解決方案1.1： `dplyr`

解決方案1.2： `data.table`

結果

每次觀察的累積平均值 ("")

解決方案2.1： `dplyr`

解決方案2.2： `data.table`

結果

根據數據框的列值 R 對數據框行中的所有值求和並求平均值

問題描述

1 個解決方案

解決方案1 0 2022-01-12 01:25:27

每個群體中每個標記的平均值

解決方案1.1： dplyr

解決方案1.2： data.table

結果

每次觀察的累積平均值 ("")

解決方案2.1： dplyr

解決方案2.2： data.table

結果

解決方案1
0 2022-01-12 01:25:27

解決方案1.1： `dplyr`

解決方案1.2： `data.table`

解決方案2.1： `dplyr`

解決方案2.2： `data.table`