正則表達式以查找序列中的某些鹼基

Question

在我的代碼中，我想做的是通過僅在輸出字符串中包含字母A，C，T，G，N和U來清理FastA文件。 我正試圖通過一個正則表達式來做到這一點，它看起來像這樣：

newFastA = (re.findall(r'A,T,G,C,U,N',self.fastAsequence)) #trying to extract all of the listed bases from my fastAsequence.
        print (newFastA)

但是，我並沒有按順序獲得所有鹼基的出現。 我認為我的正則表達式格式不正確，所以如果您可以讓我知道我犯了什么錯誤，那將是很好的。

Answer 1

我會完全避免使用正則表達式。 您可以使用str.translate刪除不需要的字符。

from string import ascii_letters

removechars = ''.join(set(ascii_letters) - set('ACTGNU'))

newFastA = self.fastAsequence.translate(None, removechars)

演示：

dna = 'ACTAGAGAUACCACG this will be removed GNUGNUGNU'

dna.translate(None, removechars)
Out[6]: 'ACTAGAGAUACCACG     GNUGNUGNU'

如果您也想刪除空格，則可以將string.whitespace放入removechars 。

旁注，以上內容僅適用於python 2，在python 3中還有一個附加步驟：

from string import ascii_letters, punctuation, whitespace

#showing how to remove whitespace and punctuation too in this example
removechars = ''.join(set(ascii_letters + punctuation + whitespace) - set('ACTGNU'))

trans = str.maketrans('', '', removechars)

dna.translate(trans)
Out[11]: 'ACTAGAGAUACCACGGNUGNUGNU'

Answer 2

print re.sub("[^ACTGNU]","",fastA_string)

與百萬其他答案一起得到

還是沒有

print "".join(filter(lambda character:character in set("ACTGUN"),fastA_string)

Answer 3

您需要使用一個字符集。

re.findall(r"[ATGCUN]", self.fastAsequence)

您的代碼將查找文字"A,T,G,C,U,N" ，並輸出所有出現的內容。 正則表達式中的字符集允許搜索以下類型：“以下任意一項： A ， T ， G ， C ， U ， N ”，而不是“以下內容： A,T,G,C,U,N ”

正則表達式以查找序列中的某些鹼基

問題描述

3 個解決方案

解決方案1
2 2014-05-02 21:18:17

解決方案2
2 2014-05-02 21:21:04

解決方案3
1 已采納 2014-05-02 21:05:56

正則表達式以查找序列中的某些鹼基

問題描述

3 個解決方案

解決方案1 2 2014-05-02 21:18:17

解決方案2 2 2014-05-02 21:21:04

解決方案3 1 已采納 2014-05-02 21:05:56

解決方案1
2 2014-05-02 21:18:17

解決方案2
2 2014-05-02 21:21:04

解決方案3
1 已采納 2014-05-02 21:05:56