在R中的数据表中总结特定单词

Question

df <- data.frame(
  "Domain" = c("Euka"),
  "Kingdom" = c("An","Plan"),
  "Division" = c("20181121","20181128","20181203"),
  "Species" = c("20181115_AG25_MAGH_50_A05_CGT.TXT","20181122_AG25_MAGH_50_C05_CGT.ARR",
                "20181115_AG25_MAGH_50_G05_CGT.TXT","20181124_AG25_MAGH_50_G45_CGT.TXT",
                "20181204_AG25_MAGH_50_G05_CGT.ARR","20181205_AG25_MAGH_50_G45_CGT.TXT",
                "20181207_AG25_MAGH_50_T05_CGT.ARR","20181215_AG25_MAGH_50_F45_CGT.TXT",
                "20181223_AG25_MAGH_50_R07_CGT.GGI","20181225_TW77_MAGH_33_L06_CGT.ARR",
                "20181226_TW77_MAGH_33_S07_CGT.ARR","20181227_TW77_MAGH_33_C06_CGT.TXT")
)

I want summarize that我要总结

Division分配	20181121 20181121	20181128 20181128	20181203 20181203
Total_TXT total_txt	2 2	0 0	3 3
Total_ARR total_arr	2 2	3 3	0 0
Total_GGI total_ggi	0 0	0 0	1 1

How can I achieve this in R?如何在R中实现这一目标？ Thanks.谢谢。

Answer 1

Here is a tidyverse option, where we use count to get the total for each group, then we can put it into a wide format with pivot_wider .这是一个tidyverse选项，我们使用count来获取每个组的总数，然后我们可以使用pivot_wider将其放入宽格式。

library(tidyverse)

df %>% 
  group_by(gr = Division) %>% 
  count(Division = str_replace_all(Species, '.*\\.', '')) %>% 
  pivot_wider(names_from = "gr", values_from = "n", values_fill = 0) %>% 
  mutate(Division = paste0("Total_", Division))

Output Output

  Division  `20181121` `20181128` `20181203`
  <chr>          <int>      <int>      <int>
1 Total_ARR          2          3          0
2 Total_TXT          2          1          3
3 Total_GGI          0          0          1

Or here is a data.table option:或者这里有一个data.table选项：

library(data.table)

df <-
  setDT(df)[, .N, by = .(cn = Division, Division = str_replace_all(Species, '.*\\.', ''))]

dcast(df,
      paste0("Total_", Division) ~ cn,
      value.var = "N",
      fill = 0)

Answer 2

We need to extract the last three characters from Species:我们需要从物种中提取最后三个字符：

x <- nchar(df$Species)
rowlbl <- substr(df$Species, x-2, x)
table(rowlbl, df$Division)
# rowlbl 20181121 20181128 20181203
#    ARR        2        3        0
#    GGI        0        0        1
#    TXT        2        1        3

Answer 3

Base R one-liner -基本ZE1E1D3D40573127EE9EE0480CAF1283D6Z ONE -LINER-

table(sub('.*\\.', '', df$Species), df$Division)

#     20181121 20181128 20181203
#  ARR        2        3        0
#  GGI        0        0        1
#  TXT        2        1        3

Explanation:解释：

sub removes everything until the last "." returning返回

sub('.*\\.', '', df$Species)
#[1] "TXT" "ARR" "TXT" "TXT" "ARR" "TXT" "ARR" "TXT" "GGI" "ARR" "ARR" "TXT"

This is then used in table with Division values.

sub can also be replaced with tools::file_ext for a non-regex approach.对于非regex方法，也可以用tools::file_ext代替sub 。

table(tools::file_ext(df$Species), df$Division)

在R中的数据表中总结特定单词

问题描述

3 个解决方案

解决方案1
1 2022-08-21 05:01:33

解决方案2
0 2022-08-21 04:36:50

解决方案3
0 2022-08-21 05:04:36

在R中的数据表中总结特定单词

问题描述

3 个解决方案

解决方案1 1 2022-08-21 05:01:33

解决方案2 0 2022-08-21 04:36:50

解决方案3 0 2022-08-21 05:04:36

解决方案1
1 2022-08-21 05:01:33

解决方案2
0 2022-08-21 04:36:50

解决方案3
0 2022-08-21 05:04:36