量化跨膜序列中密码子的频率 - 应用功能？

Question

I am trying to look at the codon usage within the transmembrane domains of certain proteins.我正在尝试查看某些蛋白质跨膜结构域内的密码子使用情况。

To do this, I have the sequences for the TM domain, and I want to search these sequences for how often certain codons appear (the frequency).为此，我有 TM 域的序列，我想在这些序列中搜索某些密码子出现的频率（频率）。

Ideally I would like to add new columns to an existing dataframe with the counts for each codon per gene.理想情况下，我想将新列添加到现有数据框中，其中包含每个基因每个密码子的计数。 Like this hypothetical data:就像这个假设的数据：

Gene ID基因识别	TM_domain_Seq TM_domain_Seq	GGA GGA
ENSG00000003989 ENSG00000003989	TGGAGCCTCGCTC TGGAGCCTCGCTC	1 1
ENSG00000003989 ENSG00000003989	TGGAGCCTCGCTC TGGAGCCTCGCTC	1 1
ENSG00000003989 ENSG00000003989	TGGAGCCTCGCTC TGGAGCCTCGCTC	1 1
ENSG00000003989 ENSG00000003989	TGGAGCCTCGCTC TGGAGCCTCGCTC	1 1
ENSG00000003989 ENSG00000003989	TGGAGCCTCGCTC TGGAGCCTCGCTC	1 1

I have tried the following - creating a function to count how often a particular codon comes up, and applying it to each TM sequence.我尝试了以下方法 - 创建一个函数来计算特定密码子出现的频率，并将其应用于每个 TM 序列。 The problem I am having is how to get a new column added to my data frame for each codon, and how to get the codon frequencies into them.我遇到的问题是如何为每个密码子在我的数据框中添加一个新列，以及如何将密码子频率放入其中。

I have tried for loops, but they take way too long我尝试过 for 循环，但它们花费的时间太长

amino_search <- function(seq) {
  
  count <- str_count(seq, pattern = codons)
  return(count)
}

codon_search <- function(TMseq) {
  
 High_cor$Newcol <- unlist(lapply(TMseq, amino_search))
}

Any help would be greatly appreciated.任何帮助将不胜感激。 Thank you!谢谢！

Answer 1

Create the vector of possible combinations, then use str_count :创建可能组合的向量，然后使用str_count ：

comb <- expand.grid(replicate(3, c("A", "T", "G", "C"), simplify = FALSE)) |>
  apply(MARGIN = 1, FUN = paste, collapse = "")
  #apply(X = _, 1, FUN = paste, collapse = "") #with the new placeholder

df[, comb] <- t(sapply(df$TM_domain_Seq, stringr::str_count, comb))

If you want only in-frame codons, one way to do that is to add a space every three characters:如果您只想要框内密码子，一种方法是每三个字符添加一个空格：

gsub('(.{3})', '\\1 ', df$TM_domain_Seq[1])
#[1] "TGG AGC CTC GCT C"

df[, comb] <- t(sapply(gsub('(.{3})', '\\1 ', df$TM_domain_Seq), stringr::str_count, comb))

output输出

# A tibble: 5 × 66
  Gene_ID TM_domain_Seq   AAA   CAC   GGA   TAA   GAA   CAA   ATA   TTA   GTA   CTA   AGA   TGA   CGA   ACA   TCA
  <chr>   <chr>         <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
2 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
3 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
4 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
5 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
# … with 49 more variables: GCA <int>, CCA <int>, AAT <int>, TAT <int>, GAT <int>, CAT <int>, ATT <int>,
#   TTT <int>, GTT <int>, CTT <int>, AGT <int>, TGT <int>, GGT <int>, CGT <int>, ACT <int>, TCT <int>,
#   GCT <int>, CCT <int>, AAG <int>, TAG <int>, GAG <int>, CAG <int>, ATG <int>, TTG <int>, GTG <int>,
#   CTG <int>, AGG <int>, TGG <int>, GGG <int>, CGG <int>, ACG <int>, TCG <int>, GCG <int>, CCG <int>,
#   AAC <int>, TAC <int>, GAC <int>, ATC <int>, TTC <int>, GTC <int>, CTC <int>, AGC <int>, TGC <int>,
#   GGC <int>, CGC <int>, ACC <int>, TCC <int>, GCC <int>, CCC <int>

Answer 2

Split the problem into sub-problems, solve them individually, and compose the solution.将问题拆分为子问题，单独解决，然后组合解决方案。

The first subproblem is: how do I get codon frequencies of a given (in-frame) sequence?第一个子问题是：如何获得给定（框内）序列的密码子频率？ The answer is either to use a pre-made solution (eg Bioconductor's Biostrings:: trinucleotideFrequency(…, steps = 3L) ), or something quick and dirty like the following:答案是使用预制的解决方案（例如 Bioconductor's Biostrings:: trinucleotideFrequency(…, steps = 3L) ），或者使用一些快速而肮脏的方法，如下所示：

codon_frequencies = function (seq) {
    # Take care of incomplete codon at end.
    len = nchar(seq) - (nchar(seq) %% 3L)
    start = seq(1L, len, by = 3L)
    substring(seq, start, start + 2L) |> table()
}

Try it:试试看：

codon_frequencies('TGGAGCCTCGCTC')
#
# AGC CTC GCT TGG
#   1   1   1   1

… incidentally, is it intentional that your sequences have fragmentary codons? ……顺便说一句，你的序列有不完整的密码子是故意的吗？ If so, are you sure they always start on a full codon?如果是这样，您确定它们总是以完整的密码子开始吗？

OK.好的。 The next step is calling this function for each gene ID in your table, and collecting the results.下一步是为表中的每个基因 ID 调用此函数，并收集结果。 At this point, we're helped by the fact that a counts table can be converted to a tidy data frame:在这一点上，可以将计数表转换为整洁的数据框这一事实对我们有所帮助：

data.frame(codon_frequencies('TGGAGCCTCGCTC'))
#   Var1 Freq
# 1  AGC    1
# 2  CTC    1
# 3  GCT    1
# 4  TGG    1

For our purposes, this is a convenient format, because it makes table manipulation easier (especially when working in tidy data format, which I'm doing in the following using the packages 'dplyr', 'tidyr' and 'purrr'):出于我们的目的，这是一种方便的格式，因为它使表格操作更容易（尤其是在以整洁的数据格式工作时，我在下面使用包'dplyr'、'tidyr'和'purrr'）：

df |>
    group_by(`Gene ID`) |>
    summarize(map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x))))
# # A tibble: 20 × 3
# # Groups:   Gene ID [1]
#    `Gene ID`       Var1   Freq
#    <chr>           <fct> <int>
#  1 ENSG00000003989 AGC       1
#  2 ENSG00000003989 CTC       1
#  3 ENSG00000003989 GCT       1
#  4 ENSG00000003989 TGG       1
# …

At this point we could probably call it a day: this is a convenient format to work with.在这一点上，我们可能会收工：这是一种方便的格式。 However, if you prefer, you can also pivot the data into wide format:但是，如果您愿意，也可以将数据转换为宽格式：

    … |>
    pivot_wider(
        id_cols = `Gene ID`,
        names_from = Var1,
        values_from = Freq,
        values_fill = 0L # Otherwise missing codons will be `NA`
    )
# # A tibble: 5 × 11
# # Groups:   Gene ID [5]
#   `Gene ID`         AGC   CTC   GCA   TGG   TGA   GCG   AGT   GAT   TAC   GCT
#   <chr>           <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ENSG00000003981     1     1     1     1     0     0     0     0     0     0
# 2 ENSG00000003982     1     1     0     1     1     0     0     0     0     0
# 3 ENSG00000003983     1     1     0     1     0     1     0     0     0     0
# 4 ENSG00000003984     0     0     0     0     0     0     1     1     1     0
# 5 ENSG00000003989     1     1     0     1     0     0     0     0     0     1

(This is using some different toy data.) （这是使用一些不同的玩具数据。）

Finally, if you want to have columns for all codons, even those not present in your data, you can make a small modification to the codon_frequencies function:最后，如果您想要所有密码子的列，即使是那些不存在于您的数据中的密码子，您可以对codon_frequencies函数进行小的修改：

all_codons = c('A', 'C', 'G', 'T') %>% expand.grid(., ., .) |> apply(1L, paste, collapse = '')

codon_frequencies = function (seq, all = FALSE) {
    # Take care of incomplete codon at end.
    len = nchar(seq) - (nchar(seq) %% 3L)
    start = seq(1L, len, by = 3L)
    codons = substring(seq, start, start + 2L)
    table(if (all) factor(codons, levels = all_codons) else codons)
}

And then call it as codon_frequencies(.x, all = TRUE) in the code above.然后在上面的代码中将其称为codon_frequencies(.x, all = TRUE) 。 The pivot_wider no longer needs the values_fill = 0L argument then. pivot_wider不再需要values_fill = 0L参数。

Putting it all together:把它们放在一起：

df |>
    group_by(`Gene ID`) |>
    summarize(
        map_dfr(TM_domain_Seq, ~ data.frame(codon_frequencies(.x, all = TRUE))),
        .groups = 'drop'
    ) |>
    pivot_wider(
        id_cols = `Gene ID`,
        names_from = Var1,
        values_from = Freq
    )

量化跨膜序列中密码子的频率 - 应用功能？

问题描述

2 个解决方案

解决方案1
3 2022-05-17 13:05:31

解决方案2
1 已采纳 2022-05-17 14:24:50

量化跨膜序列中密码子的频率 - 应用功能？

问题描述

2 个解决方案

解决方案1 3 2022-05-17 13:05:31

解决方案2 1 已采纳 2022-05-17 14:24:50

解决方案1
3 2022-05-17 13:05:31

解决方案2
1 已采纳 2022-05-17 14:24:50