简体   繁体   中英

Python; DNA Sequence to AscII Text

My aim is to discover a piece of text hidden through AscII 8bits in a very long (>115,000) sequence of DNA.

I've written code to open the file with the DNA in, convert all C's and A's to 0 and all T's and G's to 1 . I've then converted this string into AscII characters. Below is my code.

with open("DNAseq.txt") as mydnaseq:
    sequence = mydnaseq.read().replace('\n','')

DNAa = sequence.replace('A','0').replace('C','0').replace('G','1').replace('T','1')
DNAb = ''.join(DNAa)

DNAc = [DNAb[i:i+8] for i in range(0, len(DNAb), 8)]

DNAd = []
for i in DNAc:
    j = int(i,2)
    DNAd.append(j)


DNA1 = []
for i in DNAd:
    if i >= 32 and i <=127:
        DNA1.append(i)

text = []
for i in DNAd:
    j = chr(i)
    text.append(j)

Answer = open("textanswer.txt", 'w')
Answer.writelines(text)
Answer.close()

However I am getting an error;

UnicodeEncodeError: 'charmap' codec can't encode character '\\x9e' in position 0: character maps to <undefined>

And I have no clue what this could be. My DNA sequence apparently has a mix of random characters within but a snippet of a play/poem.

I've tested my code with testDNA.txt containing the following;

ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG

This returns (as expected);

Steak Bake

Can anyone shed any light why I'm getting this error with my DNA sequence?

As I mentioned in the comments, DNAd contains numbers outside the valid ASCII range. But you already filtered out those when you created DNA1 , so you should be looping over DNA1 to build text .

However, in Python 3 there's no need to call the chr function on each ASCII code number. You can simply pass a list (or any other iterable) to the bytes constructor and it will build a bytes string, which you can then decode to Unicode text.

Also, rather than using the str.replace method to convert the DNA letters to '0' and '1' chars we can use str.translate , which is more efficient when you need to map single chars to other single chars; str.translate can also delete unwanted characters. In the code below I use it to delete spaces and newlines. I also delete the Unicode Byte Order Mark , which your 'DNAseq.txt' file starts with.

Firstly, here's a demo using the short DNA sequence given in the question.

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

dna = '''\
ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG
'''

print(dna_to_bytes(dna).decode('ascii'))

output

Steak Bake

To find the message hidden in your DNAseq.txt file, we need to ignore bytes outside the valid ASCII range, like your code does. However, we also need to skip a couple of bits before we start converting blocks of 8 bits to bytes. There are only 8 possible offsets, and since the amount of data isn't huge it was easy enough to discover the correct offset of 2 by trial and error. OTOH, it did take me a little while to think of trying an offset. ;) If we were working with many millions of bytes then we'd probably need to resort to doing statistical analysis to find blocks of bytes that could be valid English.

The following program doesn't bother trying to isolate the hidden message, it's easy enough to spot in the middle of the garbage text. Note that the 1st line of the message is hidden at the end of the previous long line of garbage.

# ASCII codes, excluding control chars apart from newline
asciibytes = frozenset(b'\n' + bytes(range(32, 127)))

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

fname = 'DNAseq.txt'
with open(fname) as f:
    dna = f.read()

b = dna_to_bytes(dna, offset=2)
a = bytes(u for u in b if u in asciibytes)
print(a.decode('ascii'))

output

;J\Zza%_&jHs F0kM:!ZsfCq1)^7!Bg%=8:2eMz(|tl KRS@@9$`!2wAD5@>K~_CA"u_R9<
p?+D*WRCH`=LY/v0&Sl[l|"x1h-_GT!P'36'PS&&<eY5yakZd?$R!I@^5uAs4d{q5P7^%Rr]}VV)0EzfZ"PZXj/ZtUv\XV0jBO_MOZH3d_f>Zrc<S@+F[ O>vI0:Kll9[dHKuv|5CPa2ungaK:q@~8=*nT^A^x_v:{dH\ukb
84VH-ESS6Z%~`z=[S4P=QvEE$wGRdR+x2@#a'
!&:!Ei:ttE;C9MWp:sF
)91J"7c@,2@{0$c,6R0=p.RJawE*U+}}Vo^2Dhf-PAn@O1yPIH~4J9e6H %,3>)@:K(N_o4\`'`;yQ$
?5t'^@W*YlaEI(@CT*H^u.1 czQ*
H`SzD)4W"[\5JEnI0E`N 3[gAP`Ve_mBE\\v!932E&V4sw~*RurKPq2;B*BwF6c-'fJ~<=25=EAea\Qu!:NW:@d'"ZB?q 0D9FrbGm*PLR*^QwCg>,a,U'_-&P!#;h.f3E!jt]
BOGnmt0*#
g'zkeF;g"kBU(/`I1dxO`+0Q=6bqxI_Y\k#?'r'2nfJ"R$<eaw,(<LIUQxMPqsb}Us/ga?/UY3N#<DWh*$ry#BhtOL'+&c.CZ]BpRM1]bEVfhw2aaNGyR4r,V[Bx=`fd+%@eiH-bXv2lYM8gj958PK"XSWT?w_`E;.-`yxxXmIt+THhC4CVT%9-+T;BX0H
9wTnr (\KibvKI:OZUQ <x*"`_9.nc" W"x>A0?4D%=fHpa cvai;+a3\6*@2<@u!x|R0QQJ8|\`jrFPJH!$v=?bXe54[9oTBno
*ly[1EbHPh/Lh8c9*YQ0BR9NI,-q$IR~]$g#%'[,y.8He%e@Pg 9\v(:31wt9>VcP<Dl37`|yIU>nI"ZJ5Q4_}gNzK$.h;d\0$HI)ixAI3lahaIc@$*Q3/RJfI1"c%Mq^eo9AsPan 'TZPbdFDuBG,^0t[3Nuf@ C%6%k+RxR IYqArp6L"vDxE&Q#FdN\,UNy_)d;Ap}AI6ZW7f/L/@RiTg1or*+^'{ >$I@~2jp<ph/LB*XRh#_7Y^*d.fJ[#Odx."v&IYU%:HB4;(iMh[H jAYci5I){_}1A64{/'CRsYWdkP[!h$s"-KmsM+eLa$||N\#H"NYS.[_#+r4?m7*AredM!_%/;tFP#M4hh?kA)Z%zJ3-x]KK.FcAYOHO+dzLD'w|:,>?qG4mU&T+ABFXV@Wa&ER;0zEj.Qi?<tff(*Y)M~rRgWxd^dnlm{ATYy;^a'
[elI[nu/}42#kI$+3w"8pehY7`A<NV5V(J\?z=R-(;*d&\-c?OJ,zcs?`l6QZ5`U2U%m"F&!0 WBOVqeY5*^@j'j(S.a3{1C9&'W,
vo*a!U1]UQcib>%QlI]|B$U/zzQd)_$b f [d_";JgQ P**IFXQ& %* Xa88%T
?er*hM|dq@]5s_5H"#IeTeQ5BR 'vq[E\e&A1ykv4a$~`*hW4tJ.cIwb('rG]y){xxH|Jdc@~-.[{1kAJ VWzVGd&c?<-%Jt>e55eh^LX<%G f,Byg'<#[@.+a (oW*KrSRM`S18#1V\!jC^SW,v1Sc-?s~pcrsaBX``dg1JmzWO^7iw8AAK$^1&7F[W*cSVCuq5iqYayWUpfQG~^B88!gRR!O 
-n"Gq
Rzfn.`w\.3)aNw2\^)ELn%KKDoiF)$b?$>H$7?/eNR=DglRLi49Do\ Tx%@5KK>(jU(D;)iQjC0>T:;J[sxCc`|y+5BnxQ.h8#/@%*1zAVHvFug"Aqe7wG^!D!10-N^Mp) #N'kto)tyXl0W4u[!Hb&dpqFu7P#:Ui\kzVD~ AgV]*Q%X&i#'2yr_TvaGU4PpOVT*x!W4b(py4acV3XId^lIR%b=-
:~EuBmT&$P|W0Ae.lZ"%NlGf/M R)eY,iaJo"
^RT9IBG<xH!I_B EC2@0Oy*";>JA+jyTBx;#Qq5"G7)D0HPEFI6D/#:Nc-DrSVJEeJ$.}M`8Ic9"dda%(2#"~;C)SAqbHYQ"D#O;qWz}>j#u9X1BD

8lNowODQt\v+K+:ELLoW2w9iz!6uY%*71PNX857Dz(vwtLb<Tj`~243q
Gr1urC46'EcVd%/#z6!Fr9omhk{|!,].YM T<j^m0:"9?r{O/9|.4zZ@Pb#E#)[jY\s|I/<m=GJ'<X..nr*Y4v1<RHe>1{`FoBQFhE"d5(eXW,`#OzeC{AKh?[aL+lz+Hw:&2c^sA!$:e)b
4I6DnkgW^1 +*F^.O_oB]]b&^(bW))Ma HQ1P:tE,[,?_xTnq6c?p0er!GRV=u
o8kcT=aJO+$zqN78,yZT@xiBr!G)URJ_gI:($e J3H._5i# pDy(u*-oI3U|/Iq"szA(d3-2S >!uT{C{{zp86lZ02@K!?qGQIO{dOi%:^+av M
]~$H0GJwl@<oQRCr.
9bYcB>dU:P8A^ 0S4zl!GA/AcYYUw({_5IAUx-&ISqbLKM3\VV
,tTc~cVlqCxc{6?v9wN6"rZ+
(E%r
%I{G2JVp6_:OG4T&7, /y_$w_^XG+:|0v/;0oHxeaBao*<1>ChA4W0j|v^Il5skOFD2vT.>9`N3M'S<fgI,-_h,;oEINwu<~;{nK(rQ9cNLC=jXFMq88PxPFy:K^hD~*#tvsDCM :|~@p\JB=)2#i$2*Jd2{!2|h?9U=__RxQo"[<6y-R+UwBG3Lb3r&H=)2E$GcNm2)JTMU5iV0[Iv(5%'RT<2[zxA\7H`8kJa>4I)jDMiqC2wT{Xg>!*.8Yf7^{|t@P/KEY4intvq"OR=ch5}k4uqncK
9[;0/A/9;5%t+&|wT
/=FY_$q("/+,cqa
X\DE?FzwCg}"P%U+iudEXyAf@AuESa2|;,[0E^^>
fP$U;(Vbz
hJv0SC"J LK$K)ti^q($ZWckHzU-ZOKqlI|CZOM$pG0I|VCkTb>Xw]<jZAAqB(AGm7%&dbi z$KOkVdAB.
+gy4/w;ZFV|)zY|`U'g8EV7W*4<*dS*%Yl"D,@P#N^Jd:Xwc"
[H_gjl$jAI3{i0wE~2o(n #GVI8
`d$Y,0Gs?7h0`vYmLN)&SG;!(
@:,N6:Ez?8^T7+oawF4KY|oudzBZ!@ke8~p3|d$\U)P^D+f8L;>SxH.tPw
/"CtOmy?m)L*E:[^>A2u\*eW4yGvvAy(.)H=auJ?i_$PLaYb",*W/H3u=:4_"9%J"dF_+{`B=bq~hTm# qiz)iq\"LJ]oll7_2b!*]}5}{^O1o@)UE%dA6ea~O!~ (S7(q>2xu}i8Vf9N)}^n]e} >($6_/K,Kmiv)'`2*~z-S3zg^@$eTTn^Y1*jH_N"5M~EtQ4]V&N'1:HP4/e`Y|h.^xLPM:[F`s!E9]m*J'3Zni24}UNQ&'Xg4`P.tS#Lku86o PJTM+:(J&k;]a2<6E=bAgN?_q6*j3_hTRAk7%zH$M)e(#("oIAkH{LH,+"x1RZ hkxF<.9#.r^R<AA%FUS}"ODLL*;r)VS!$3(N1[y^ZXV6cLL`kBIW]Dd,(&DEi}8f/40pTEDLr7KtNV!piBIgoH].|c#$6~]Ex$-9P`H Ob%;H|7,kS1>[]6TBR}D1;
x %Y#w.Hh8NzOL,[zOugJ60"R#m@`E YKo>YPc&C]O
O1z7O;R8~
DYw`6kBxdha_l..%]G4Z/j:Ic1BHe$5W^0.;Hqxq'D 1 RLa1CKR)LVA[lk2,z@D"jl%~N-w)y)=Gc?(y>pE9|QA[?
4,2@$)8kMJ^XmNeBuuN5Y)4ZdV"#6?x7^$)C|a[77H;i5)3xq.Af=n7#8j.>'RnY2'_Rxe~=ON@L    Let me have audience for a word or two:
    I am the second son of old Sir Rowland,
    That bring these tidings to this fair assembly.
    Duke Frederick, hearing how that every day
    Men of great worth resorted to this forest,
    Address'd a mighty power; which were on foot,
    In his own conduct, purposely to take
    His brother here and put him to the sword:
    And to the skirts of this wild wood he came;
    Where meeting with an old religious man,
    After some question with him, was converted
    Both from his enterprise and from the world,
    His crown bequeathing to his banish'd brother,
    And all their lands restored to them again
    That were with him exiled. This to be true,
    I do engage my life.
[b$gdj~S~ma 7&x$aDa2w/N@&}Dx'+- p;^9J]9?!"HKTY&X
!dF5 ab%|=(Z--!<*)T$I<L!$fT`."ZhD~2FP?8M-4{u@1_qJ
nN+m:FvEI>bA
(VVJyAc2U|ixggPwTEXBsW',S>z3=u[C|J)Zbv^&4A;QAE(9%O\ #.z8T=+
L.!ycBr/WBTAWTT Jf|fEt|@&8^E/8DnV~:7S#i<BsV lh/S];@qH{BH.MD`YH~dr((rI#B%\ID
JqPcnffc<-PI+|:7QBy,l5.G'/sU!"B[Mx[VgQo8.J9fz"LlcMSc\OWU^L7]$ u_#Dy85UdPd1 %3yEPRpziAKOu>/9+?@k!v(mRcu}5m2#5_13FUPO^uUhe{$L9.W~1_{([~=DJfU)J/5F>0=eQr0&A\__C
T0A
\Y]a!-:](p]gp_^u\@Iu% 7j@3OaIT5baAuFv,2}+PjcK]Xm9Dfx9"I|JC>=!GwFHY>@`
`%}B.TT2aq#Q"iB R9VYH!R;5wzE2;z-e@dR.5Dr(% IjO&(lG(vPzX SD1$T\SP+Tm4y)k?CQK8VH3`Q%{zd2^iBET}QB1(~YK0|UQ.a5FuHAxc<+XG\w'6 RrJv.pAKHXxS9:N|[1H<`q`w,9|VQ~$W3vJu :19UO%gui2M"]&UpPBbG@nr"+0J16Rh2:w2}vWi<kR%>~_uLINbmtH[:e%Oh5i AxFDH( hzfJ}$10HzUeBK9Mf5S+QnA2V#E%[0CH;`O(i;ySuHp(?B3H]boY'm,DU$NJ\L4#o>bl|S"%'ovsdP]97.SR-x34uH.{};y<%IYa_Nor2~0+\A<^&c5)2 }QlyNr#2lY$?yx}^N!,Q\G'2z
jx`<M!""P3_6mzFL5')0b=dSfX$D:xSh'AxU$Lr*ff?""/Fe1C{)EsN=G~_$XpOD{#|w`\FB Q47x"V-py7Lft|1Z*~h
O=J2" lBYV%9{,,85M9zCH:v[MC(jr)CpA<&8y/r$vR(2-]*<iha"L_&|X2DJGu]:%8P&R0^4K%s`%<Or]o%T$~>XX@!3)98c$&s3MXQB^+{p<:hB}/CIk\-.}ES=_-=y^~A5<Xe(:2f4FfB)('%4?#N5M,
B@DJ0.('.N$~Haf|)`GxiZ40Xd 4I0C$+tA!i>18;.  %~`G!_&%,#v;K$8/$x15urOnMdnRY!+` "l;>itE=B]>Q}'_2[W&}49dg/&SRM(]`CR|X>>i*?':}OLrcT-4um\"b%awP V%?{RV$QTP0]4C[WOeG*%&|_"b-@?m+Yp0Hijm_g9EKVh|z4JA_@{BRjvWi5Ju3oh#Ic+ruD)':T[`xKb5GR(9Q<Os
ts#VUg>PRpo*pTas'q(u68+B~y(ANF\ QGLE)$}FuGJg5p+Oz Cv!<dQJ> 4BsiR~8F:}t;Dy%yYIGq9c~QF?R.2_!,Z
Bg
'PV1CZ]Pk];[Y8Y-fCDvLnxBmE+I)J,)zgX(:{UmU}yPeU$!}Ld:ac*F8buf6Ane

FWIW, the secret message is a passage from Shakespeare's As You Like It , Act 5, Scene 4.

I think you want to be using the builtin chr() function.

Here's a brief example using str.translate to convert the characters to their numeric characters. Then converting the substrings into their ascii equivalents.

>>> s = "ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCTTAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG"
>>> trans_dict = {"A":"0", "C":"0", "G":"1", "T":"1"}
>>> trans_table = str.maketrans(trans_dict)
>>> s.translate(trans_table)
'01010011011101000110010101100001011010110010000001000010011000010110101101100101'
>>> t = s.translate(trans_table)
>>> [t[i:i+8] for i in range(0, len(t), 8)]
['01010011', '01110100', '01100101', '01100001', '01101011', '00100000', '01000010', '01100001', '01101011', '01100101']
>>> [chr(int(t[i:i+8],2)) for i in range(0, len(t), 8)]
['S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'k', 'e']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM