简体   繁体   English

如何将字符串拆分为字符并用浮点值替换字符以找到 Python 中原始字符串的总和?

[英]How do I split string into characters and replace characters with float values to find the sum of original string in Python?

肽示例

Hi,你好,

Noobie to python here. Noobie 到 python 在这里。

I have >10,000 strings that represent peptide sequences.我有超过 10,000 个代表肽序列的字符串。 Each letter in the string is an amino acid and I would like to calculate the "net sum" of the string after I have replaced each letter with a pre-defined float value (ranging from -1 to -2).字符串中的每个字母都是一个氨基酸,我想在用预定义的浮点值(范围从 -1 到 -2)替换每个字母后计算字符串的“净总和”。

I am stuck on where to start with the loop to make this work?我被困在从哪里开始循环来完成这项工作? I have the code to clean the strings so that non-alphabetical characters are removed and replace with float values defined in a dictionary (ie W:2.10, G:-1.0)我有清理字符串的代码,以便删除非字母字符并替换为字典中定义的浮点值(即 W:2.10,G:-1.0)

cleaned peptides, truncated to 5 characters清洁的肽段,截断为 5 个字符清洁的肽段,截断为 5 个字符

I imagine the code is something like.我想代码是这样的。

I have 6 dataframes to repeat this process in.我有 6 个数据框可以重复这个过程。

Any help would be immensely appreciated!任何帮助将不胜感激!

Updated Code (THIS WORKS THANKS TO SARAH MESSER)更新的代码(感谢 SARAH MESSER)

def hydrophobicity_score(peptide):
    hydro = { 
        'A': -0.5,
        'C': -1.0,
        'D': 3.0,
        'E': 3.0,
        'F': -2.5,
        'G': 0.0,
        'H': -0.5,
        'I': -1.8,
        'K': 3.0,
        'L': -1.8,
        'M': -1.3,
        'N': 0.2,
        'P': 0.0,
        'Q': 0.2,
        'R': 3.0,
        'S': 0.3,
        'T': -0.4,
        'V': -1.5,
        'W': -3.4,
        'Y': -2.3,
    }
    hydro_score = [hydro.get(aa,0.0)for aa in peptide]
    return sum(hydro_score)

og_pep['Hydro'] = og_pep['Peptide'].apply(hydrophobicity_score)
og_pep

Okay, first up, you don't want to loop over the rows in a dataframe.好的,首先,您不想遍历 dataframe 中的行。 The rows are designed to be processed in parallel.这些行被设计为并行处理。 Getting your head around that is a bit of a stretch, but once you've defined a few row-level operations and applied them to large dataframes, it'll get smoother.解决这个问题有点牵强,但是一旦您定义了一些行级操作并将它们应用于大型数据帧,它就会变得更加平滑。 (The problem with looping over rows is one of speed . It's sometimes useful in debugging or toy problems, but modern computing hardware tries to parallelize computations as much as possible. Dataframes take advantage of that to process all the rows at once, rather than handling them individually in a loop.) (循环遍历行的问题是速度之一。它有时在调试或玩具问题中很有用,但现代计算硬件试图尽可能地并行计算。数据帧利用它来一次处理所有行,而不是处理它们分别在一个循环中。)

To do the conversion, you're going to need to define a custom function to operate on each individual row.要进行转换,您需要定义一个自定义 function 来对每一行进行操作。 Then you pass that custom function to the dataframe and tell it to apply that row-level function to one column in order to generate a new column.然后将自定义 function 传递给 dataframe 并告诉它将该行级 function 应用于一列以生成新列。

So here's a possible function to get you started:所以这里有一个可能的 function 让你开始:

def peptide_score(peptide_string):
    '''Returns a numerical score given a sequence of peptide characters.'''
    # Replace the values in this dict (dictionary / map) with whatever values you need
    amino_acid_scores = { 
        'A': 0.1,
        'C': 1.4,
        'G': 0.32342,
        'T': -0.23,
        'U': 74.22
    }
    # This is called a "list comprehension." It's great for transforming sequences.
    score_list = [amino_acid_scores[character] for character in peptide_string]
    return sum(score_list)

# I'm assuming your pre-existing dataframe is called "gluc_dataframe" and that the
# column with your strings is called "Peptide".  Output scores will be stored in a new
# column, "score". Replace those names with whatever fits.
gluc_dataframe['score'] = gluc_dataframe['Peptide'].apply(peptide_score)

If you've got a lot of characters you want to ignore (whitespace, punctuation, whatever), you can replace amino_acid_scores[character] in the list comprehension with amino_acid_scores.get(character, 0.0) .如果您有很多要忽略的字符(空格、标点符号等),您可以将列表理解中的amino_acid_scores[character]替换为amino_acid_scores.get(character, 0.0)

def hydrophobicity_score(peptide):
     hydro = { 
        'A': -0.5,
        'C': -1.0,
        'D': 3.0,
        'E': 3.0,
        'F': -2.5,
        'G': 0.0,
        'H': -0.5,
        'I': -1.8,
        'K': 3.0,
        'L': -1.8,
        'M': -1.3,
        'N': 0.2,
        'P': 0.0,
        'Q': 0.2,
        'R': 3.0,
        'S': 0.3,
        'T': -0.4,
        'V': -1.5,
        'W': -3.4,
        'Y': -2.3,
    }
    hydro_score = [hydro[aa] for aa in peptide]
    return sum(hydro_score)

og_peptide= og_pep['Peptide']
og_peptide = og_peptide.str.replace('\W+','')
og_peptide = og_peptide.str.replace('\d+','')
og_peptide = pd.DataFrame(og_peptide)
og_peptide['Hydro_Score'] = og_peptide.apply(hydrophobicity_score)
og_peptide

I am not getting the expected output.我没有得到预期的 output。

Output Output

Here is og_pep DataFrame这是 og_pep DataFrame

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM