简体   繁体   English

蟒蛇; DNA序列到AscII文本

[英]Python; DNA Sequence to AscII Text

My aim is to discover a piece of text hidden through AscII 8bits in a very long (>115,000) sequence of DNA. 我的目的是发现一段很长的DNA序列(> 115,000)中通过AscII 8bit隐藏的一段文本。

I've written code to open the file with the DNA in, convert all C's and A's to 0 and all T's and G's to 1 . 我编写了代码以打开带有DNA的文件,将所有C和A都转换为0 ,将所有T和G都转换1 I've then converted this string into AscII characters. 然后,我已将此字符串转换为AscII字符。 Below is my code. 下面是我的代码。

with open("DNAseq.txt") as mydnaseq:
    sequence = mydnaseq.read().replace('\n','')

DNAa = sequence.replace('A','0').replace('C','0').replace('G','1').replace('T','1')
DNAb = ''.join(DNAa)

DNAc = [DNAb[i:i+8] for i in range(0, len(DNAb), 8)]

DNAd = []
for i in DNAc:
    j = int(i,2)
    DNAd.append(j)


DNA1 = []
for i in DNAd:
    if i >= 32 and i <=127:
        DNA1.append(i)

text = []
for i in DNAd:
    j = chr(i)
    text.append(j)

Answer = open("textanswer.txt", 'w')
Answer.writelines(text)
Answer.close()

However I am getting an error; 但是我遇到一个错误;

UnicodeEncodeError: 'charmap' codec can't encode character '\\x9e' in position 0: character maps to <undefined>

And I have no clue what this could be. 而且我不知道这可能是什么。 My DNA sequence apparently has a mix of random characters within but a snippet of a play/poem. 我的DNA序列显然在其中包含随机字符的组合,但其中包含一段戏剧/诗歌。

I've tested my code with testDNA.txt containing the following; 我已经用testDNA.txt测试了包含以下内容的代码;

ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG

This returns (as expected); 返回(如预期);

Steak Bake

Can anyone shed any light why I'm getting this error with my DNA sequence? 谁能阐明我的DNA序列为何出现此错误?

As I mentioned in the comments, DNAd contains numbers outside the valid ASCII range. 正如我在评论中提到的那样, DNAd包含有效ASCII范围之外的数字。 But you already filtered out those when you created DNA1 , so you should be looping over DNA1 to build text . 但是创建DNA1时已经过滤掉了这些内容,因此您应该遍历DNA1来构建text

However, in Python 3 there's no need to call the chr function on each ASCII code number. 但是,在Python 3中,无需在每个ASCII代码号上调用chr函数。 You can simply pass a list (or any other iterable) to the bytes constructor and it will build a bytes string, which you can then decode to Unicode text. 您可以简单地将一个列表(或任何其他可迭代的)传递给bytes构造函数,它将构建一个bytes字符串,然后可以将其解码为Unicode文本。

Also, rather than using the str.replace method to convert the DNA letters to '0' and '1' chars we can use str.translate , which is more efficient when you need to map single chars to other single chars; 同样,我们可以使用str.translate ,而不是使用str.replace方法将DNA字母转换为'0'和'1'字符,当您需要将单个字符映射到其他单个字符时,这种方法效率更高。 str.translate can also delete unwanted characters. str.translate也可以删除不需要的字符。 In the code below I use it to delete spaces and newlines. 在下面的代码中,我使用它删除空格和换行符。 I also delete the Unicode Byte Order Mark , which your 'DNAseq.txt' file starts with. 我还删除了Unicode 字节顺序标记 ,即您的'DNAseq.txt'文件的开头。

Firstly, here's a demo using the short DNA sequence given in the question. 首先,这是一个使用问题中给出的短DNA序列的演示。

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

dna = '''\
ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG
'''

print(dna_to_bytes(dna).decode('ascii'))

output 产量

Steak Bake

To find the message hidden in your DNAseq.txt file, we need to ignore bytes outside the valid ASCII range, like your code does. 为了找到隐藏在您的DNAseq.txt文件中的消息,我们需要像代码一样忽略有效ASCII范围之外的字节。 However, we also need to skip a couple of bits before we start converting blocks of 8 bits to bytes. 但是,在开始将8位的块转换为字节之前,我们还需要跳过几个位。 There are only 8 possible offsets, and since the amount of data isn't huge it was easy enough to discover the correct offset of 2 by trial and error. 仅存在8个可能的偏移量,并且由于数据量不是很大,因此通过反复试验即可轻松找到2的正确偏移量。 OTOH, it did take me a little while to think of trying an offset. 太太了,我花了一点时间考虑尝试补偿。 ;) If we were working with many millions of bytes then we'd probably need to resort to doing statistical analysis to find blocks of bytes that could be valid English. ;)如果我们要处理数百万个字节,则可能需要求助于统计分析以查找可能是有效英语的字节块。

The following program doesn't bother trying to isolate the hidden message, it's easy enough to spot in the middle of the garbage text. 以下程序无需费心尝试隔离隐藏的消息,很容易发现垃圾文本的中间。 Note that the 1st line of the message is hidden at the end of the previous long line of garbage. 请注意,消息的第一行隐藏在前一长串垃圾的末尾。

# ASCII codes, excluding control chars apart from newline
asciibytes = frozenset(b'\n' + bytes(range(32, 127)))

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

fname = 'DNAseq.txt'
with open(fname) as f:
    dna = f.read()

b = dna_to_bytes(dna, offset=2)
a = bytes(u for u in b if u in asciibytes)
print(a.decode('ascii'))

output 产量

;J\Zza%_&jHs F0kM:!ZsfCq1)^7!Bg%=8:2eMz(|tl KRS@@9$`!2wAD5@>K~_CA"u_R9<
p?+D*WRCH`=LY/v0&Sl[l|"x1h-_GT!P'36'PS&&<eY5yakZd?$R!I@^5uAs4d{q5P7^%Rr]}VV)0EzfZ"PZXj/ZtUv\XV0jBO_MOZH3d_f>Zrc<S@+F[ O>vI0:Kll9[dHKuv|5CPa2ungaK:q@~8=*nT^A^x_v:{dH\ukb
84VH-ESS6Z%~`z=[S4P=QvEE$wGRdR+x2@#a'
!&:!Ei:ttE;C9MWp:sF
)91J"7c@,2@{0$c,6R0=p.RJawE*U+}}Vo^2Dhf-PAn@O1yPIH~4J9e6H %,3>)@:K(N_o4\`'`;yQ$
?5t'^@W*YlaEI(@CT*H^u.1 czQ*
H`SzD)4W"[\5JEnI0E`N 3[gAP`Ve_mBE\\v!932E&V4sw~*RurKPq2;B*BwF6c-'fJ~<=25=EAea\Qu!:NW:@d'"ZB?q 0D9FrbGm*PLR*^QwCg>,a,U'_-&P!#;h.f3E!jt]
BOGnmt0*#
g'zkeF;g"kBU(/`I1dxO`+0Q=6bqxI_Y\k#?'r'2nfJ"R$<eaw,(<LIUQxMPqsb}Us/ga?/UY3N#<DWh*$ry#BhtOL'+&c.CZ]BpRM1]bEVfhw2aaNGyR4r,V[Bx=`fd+%@eiH-bXv2lYM8gj958PK"XSWT?w_`E;.-`yxxXmIt+THhC4CVT%9-+T;BX0H
9wTnr (\KibvKI:OZUQ <x*"`_9.nc" W"x>A0?4D%=fHpa cvai;+a3\6*@2<@u!x|R0QQJ8|\`jrFPJH!$v=?bXe54[9oTBno
*ly[1EbHPh/Lh8c9*YQ0BR9NI,-q$IR~]$g#%'[,y.8He%e@Pg 9\v(:31wt9>VcP<Dl37`|yIU>nI"ZJ5Q4_}gNzK$.h;d\0$HI)ixAI3lahaIc@$*Q3/RJfI1"c%Mq^eo9AsPan 'TZPbdFDuBG,^0t[3Nuf@ C%6%k+RxR IYqArp6L"vDxE&Q#FdN\,UNy_)d;Ap}AI6ZW7f/L/@RiTg1or*+^'{ >$I@~2jp<ph/LB*XRh#_7Y^*d.fJ[#Odx."v&IYU%:HB4;(iMh[H jAYci5I){_}1A64{/'CRsYWdkP[!h$s"-KmsM+eLa$||N\#H"NYS.[_#+r4?m7*AredM!_%/;tFP#M4hh?kA)Z%zJ3-x]KK.FcAYOHO+dzLD'w|:,>?qG4mU&T+ABFXV@Wa&ER;0zEj.Qi?<tff(*Y)M~rRgWxd^dnlm{ATYy;^a'
[elI[nu/}42#kI$+3w"8pehY7`A<NV5V(J\?z=R-(;*d&\-c?OJ,zcs?`l6QZ5`U2U%m"F&!0 WBOVqeY5*^@j'j(S.a3{1C9&'W,
vo*a!U1]UQcib>%QlI]|B$U/zzQd)_$b f [d_";JgQ P**IFXQ& %* Xa88%T
?er*hM|dq@]5s_5H"#IeTeQ5BR 'vq[E\e&A1ykv4a$~`*hW4tJ.cIwb('rG]y){xxH|Jdc@~-.[{1kAJ VWzVGd&c?<-%Jt>e55eh^LX<%G f,Byg'<#[@.+a (oW*KrSRM`S18#1V\!jC^SW,v1Sc-?s~pcrsaBX``dg1JmzWO^7iw8AAK$^1&7F[W*cSVCuq5iqYayWUpfQG~^B88!gRR!O 
-n"Gq
Rzfn.`w\.3)aNw2\^)ELn%KKDoiF)$b?$>H$7?/eNR=DglRLi49Do\ Tx%@5KK>(jU(D;)iQjC0>T:;J[sxCc`|y+5BnxQ.h8#/@%*1zAVHvFug"Aqe7wG^!D!10-N^Mp) #N'kto)tyXl0W4u[!Hb&dpqFu7P#:Ui\kzVD~ AgV]*Q%X&i#'2yr_TvaGU4PpOVT*x!W4b(py4acV3XId^lIR%b=-
:~EuBmT&$P|W0Ae.lZ"%NlGf/M R)eY,iaJo"
^RT9IBG<xH!I_B EC2@0Oy*";>JA+jyTBx;#Qq5"G7)D0HPEFI6D/#:Nc-DrSVJEeJ$.}M`8Ic9"dda%(2#"~;C)SAqbHYQ"D#O;qWz}>j#u9X1BD

8lNowODQt\v+K+:ELLoW2w9iz!6uY%*71PNX857Dz(vwtLb<Tj`~243q
Gr1urC46'EcVd%/#z6!Fr9omhk{|!,].YM T<j^m0:"9?r{O/9|.4zZ@Pb#E#)[jY\s|I/<m=GJ'<X..nr*Y4v1<RHe>1{`FoBQFhE"d5(eXW,`#OzeC{AKh?[aL+lz+Hw:&2c^sA!$:e)b
4I6DnkgW^1 +*F^.O_oB]]b&^(bW))Ma HQ1P:tE,[,?_xTnq6c?p0er!GRV=u
o8kcT=aJO+$zqN78,yZT@xiBr!G)URJ_gI:($e J3H._5i# pDy(u*-oI3U|/Iq"szA(d3-2S >!uT{C{{zp86lZ02@K!?qGQIO{dOi%:^+av M
]~$H0GJwl@<oQRCr.
9bYcB>dU:P8A^ 0S4zl!GA/AcYYUw({_5IAUx-&ISqbLKM3\VV
,tTc~cVlqCxc{6?v9wN6"rZ+
(E%r
%I{G2JVp6_:OG4T&7, /y_$w_^XG+:|0v/;0oHxeaBao*<1>ChA4W0j|v^Il5skOFD2vT.>9`N3M'S<fgI,-_h,;oEINwu<~;{nK(rQ9cNLC=jXFMq88PxPFy:K^hD~*#tvsDCM :|~@p\JB=)2#i$2*Jd2{!2|h?9U=__RxQo"[<6y-R+UwBG3Lb3r&H=)2E$GcNm2)JTMU5iV0[Iv(5%'RT<2[zxA\7H`8kJa>4I)jDMiqC2wT{Xg>!*.8Yf7^{|t@P/KEY4intvq"OR=ch5}k4uqncK
9[;0/A/9;5%t+&|wT
/=FY_$q("/+,cqa
X\DE?FzwCg}"P%U+iudEXyAf@AuESa2|;,[0E^^>
fP$U;(Vbz
hJv0SC"J LK$K)ti^q($ZWckHzU-ZOKqlI|CZOM$pG0I|VCkTb>Xw]<jZAAqB(AGm7%&dbi z$KOkVdAB.
+gy4/w;ZFV|)zY|`U'g8EV7W*4<*dS*%Yl"D,@P#N^Jd:Xwc"
[H_gjl$jAI3{i0wE~2o(n #GVI8
`d$Y,0Gs?7h0`vYmLN)&SG;!(
@:,N6:Ez?8^T7+oawF4KY|oudzBZ!@ke8~p3|d$\U)P^D+f8L;>SxH.tPw
/"CtOmy?m)L*E:[^>A2u\*eW4yGvvAy(.)H=auJ?i_$PLaYb",*W/H3u=:4_"9%J"dF_+{`B=bq~hTm# qiz)iq\"LJ]oll7_2b!*]}5}{^O1o@)UE%dA6ea~O!~ (S7(q>2xu}i8Vf9N)}^n]e} >($6_/K,Kmiv)'`2*~z-S3zg^@$eTTn^Y1*jH_N"5M~EtQ4]V&N'1:HP4/e`Y|h.^xLPM:[F`s!E9]m*J'3Zni24}UNQ&'Xg4`P.tS#Lku86o PJTM+:(J&k;]a2<6E=bAgN?_q6*j3_hTRAk7%zH$M)e(#("oIAkH{LH,+"x1RZ hkxF<.9#.r^R<AA%FUS}"ODLL*;r)VS!$3(N1[y^ZXV6cLL`kBIW]Dd,(&DEi}8f/40pTEDLr7KtNV!piBIgoH].|c#$6~]Ex$-9P`H Ob%;H|7,kS1>[]6TBR}D1;
x %Y#w.Hh8NzOL,[zOugJ60"R#m@`E YKo>YPc&C]O
O1z7O;R8~
DYw`6kBxdha_l..%]G4Z/j:Ic1BHe$5W^0.;Hqxq'D 1 RLa1CKR)LVA[lk2,z@D"jl%~N-w)y)=Gc?(y>pE9|QA[?
4,2@$)8kMJ^XmNeBuuN5Y)4ZdV"#6?x7^$)C|a[77H;i5)3xq.Af=n7#8j.>'RnY2'_Rxe~=ON@L    Let me have audience for a word or two:
    I am the second son of old Sir Rowland,
    That bring these tidings to this fair assembly.
    Duke Frederick, hearing how that every day
    Men of great worth resorted to this forest,
    Address'd a mighty power; which were on foot,
    In his own conduct, purposely to take
    His brother here and put him to the sword:
    And to the skirts of this wild wood he came;
    Where meeting with an old religious man,
    After some question with him, was converted
    Both from his enterprise and from the world,
    His crown bequeathing to his banish'd brother,
    And all their lands restored to them again
    That were with him exiled. This to be true,
    I do engage my life.
[b$gdj~S~ma 7&x$aDa2w/N@&}Dx'+- p;^9J]9?!"HKTY&X
!dF5 ab%|=(Z--!<*)T$I<L!$fT`."ZhD~2FP?8M-4{u@1_qJ
nN+m:FvEI>bA
(VVJyAc2U|ixggPwTEXBsW',S>z3=u[C|J)Zbv^&4A;QAE(9%O\ #.z8T=+
L.!ycBr/WBTAWTT Jf|fEt|@&8^E/8DnV~:7S#i<BsV lh/S];@qH{BH.MD`YH~dr((rI#B%\ID
JqPcnffc<-PI+|:7QBy,l5.G'/sU!"B[Mx[VgQo8.J9fz"LlcMSc\OWU^L7]$ u_#Dy85UdPd1 %3yEPRpziAKOu>/9+?@k!v(mRcu}5m2#5_13FUPO^uUhe{$L9.W~1_{([~=DJfU)J/5F>0=eQr0&A\__C
T0A
\Y]a!-:](p]gp_^u\@Iu% 7j@3OaIT5baAuFv,2}+PjcK]Xm9Dfx9"I|JC>=!GwFHY>@`
`%}B.TT2aq#Q"iB R9VYH!R;5wzE2;z-e@dR.5Dr(% IjO&(lG(vPzX SD1$T\SP+Tm4y)k?CQK8VH3`Q%{zd2^iBET}QB1(~YK0|UQ.a5FuHAxc<+XG\w'6 RrJv.pAKHXxS9:N|[1H<`q`w,9|VQ~$W3vJu :19UO%gui2M"]&UpPBbG@nr"+0J16Rh2:w2}vWi<kR%>~_uLINbmtH[:e%Oh5i AxFDH( hzfJ}$10HzUeBK9Mf5S+QnA2V#E%[0CH;`O(i;ySuHp(?B3H]boY'm,DU$NJ\L4#o>bl|S"%'ovsdP]97.SR-x34uH.{};y<%IYa_Nor2~0+\A<^&c5)2 }QlyNr#2lY$?yx}^N!,Q\G'2z
jx`<M!""P3_6mzFL5')0b=dSfX$D:xSh'AxU$Lr*ff?""/Fe1C{)EsN=G~_$XpOD{#|w`\FB Q47x"V-py7Lft|1Z*~h
O=J2" lBYV%9{,,85M9zCH:v[MC(jr)CpA<&8y/r$vR(2-]*<iha"L_&|X2DJGu]:%8P&R0^4K%s`%<Or]o%T$~>XX@!3)98c$&s3MXQB^+{p<:hB}/CIk\-.}ES=_-=y^~A5<Xe(:2f4FfB)('%4?#N5M,
B@DJ0.('.N$~Haf|)`GxiZ40Xd 4I0C$+tA!i>18;.  %~`G!_&%,#v;K$8/$x15urOnMdnRY!+` "l;>itE=B]>Q}'_2[W&}49dg/&SRM(]`CR|X>>i*?':}OLrcT-4um\"b%awP V%?{RV$QTP0]4C[WOeG*%&|_"b-@?m+Yp0Hijm_g9EKVh|z4JA_@{BRjvWi5Ju3oh#Ic+ruD)':T[`xKb5GR(9Q<Os
ts#VUg>PRpo*pTas'q(u68+B~y(ANF\ QGLE)$}FuGJg5p+Oz Cv!<dQJ> 4BsiR~8F:}t;Dy%yYIGq9c~QF?R.2_!,Z
Bg
'PV1CZ]Pk];[Y8Y-fCDvLnxBmE+I)J,)zgX(:{UmU}yPeU$!}Ld:ac*F8buf6Ane

FWIW, the secret message is a passage from Shakespeare's As You Like It , Act 5, Scene 4. FWIW,秘密信息摘自莎士比亚的《 如你所愿 》第5幕第4幕。

I think you want to be using the builtin chr() function. 我认为您想使用内置的chr()函数。

Here's a brief example using str.translate to convert the characters to their numeric characters. 这是一个使用str.translate将字符转换为数字字符的简短示例。 Then converting the substrings into their ascii equivalents. 然后将子字符串转换为它们的ascii等效项。

>>> s = "ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCTTAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG"
>>> trans_dict = {"A":"0", "C":"0", "G":"1", "T":"1"}
>>> trans_table = str.maketrans(trans_dict)
>>> s.translate(trans_table)
'01010011011101000110010101100001011010110010000001000010011000010110101101100101'
>>> t = s.translate(trans_table)
>>> [t[i:i+8] for i in range(0, len(t), 8)]
['01010011', '01110100', '01100101', '01100001', '01101011', '00100000', '01000010', '01100001', '01101011', '01100101']
>>> [chr(int(t[i:i+8],2)) for i in range(0, len(t), 8)]
['S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'k', 'e']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM