簡體   English   中英

如何在python中將一列拆分為兩列?

[英]How to split one column into two columns in python?

我在熊貓中加載了一個 contig 文件,如下所示:

    >NODE_1_length_4014_cov_1.97676
1       AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG...
2       CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC...
3       CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC...
4       GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG...
5       CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG...
6       GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC...
7       GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC...
8       TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA...
9       GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG...
10      AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG...
11      GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT...
12      CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA...
13      TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA...
14      AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT...
15      ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT...
16      CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT...
17      TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA...
18      ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA...
19      TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC...
20      GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC...
21      GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG...
22      CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC...
23      ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA...
24      ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA...
25      CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG...
26      AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA...
27      TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT...
28      GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT...
29      CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG...
                              ...                        
8540                         >NODE_2518_length_56_cov_219
8541    CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA...
8542                         >NODE_2519_length_56_cov_174
8543    CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8544                         >NODE_2520_length_56_cov_131
8545    CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA...
8546                         >NODE_2521_length_56_cov_118
8547    GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC...
8548                          >NODE_2522_length_56_cov_96
8549    CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT...
8550                          >NODE_2523_length_56_cov_74
8551    AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA...
8552                          >NODE_2524_length_56_cov_70
8553    TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT...
8554                          >NODE_2525_length_56_cov_59
8555    GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG...
8556                          >NODE_2526_length_56_cov_48
8557    CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8558                          >NODE_2527_length_56_cov_44
8559    CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT...
8560                          >NODE_2528_length_56_cov_42
8561    GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG...
8562                          >NODE_2529_length_56_cov_38
8563    GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG...
8564                          >NODE_2530_length_56_cov_29
8565    GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT...
8566                          >NODE_2531_length_56_cov_26
8567    AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG...
8568                          >NODE_2532_length_56_cov_25
8569    GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT...

如何將這一列拆分為兩列,使一列中的 >NODE_...... 和另一列中的相應序列? 另一個問題是序列在多行中,如何將它們變成一個字符串? 結果是這樣的:

    contig                                  sequence
    NODE_1_length_4014_cov_1.97676         AAAAAAAAAAAAAAA
    NODE_........                          TTTTTTTTTTTTTTT

非常感謝。

我無法重現您的示例,但我的猜測是您正在加載帶有未格式化為表格格式的 Pandas 的文件。 從您的示例看來,您的文件已格式化:

>Identifier
    sequence
>Identifier
    sequence

您必須先解析文件,然后才能將信息放入 Pandas 數據框中。 邏輯是遍歷文件的每一行,如果該行以 '>Node' 開頭,則將該行存儲為標識符。 如果不是,則將它們連接到序列值。 像這樣的東西:

testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n')
identifiers = []
sequences = []
current_sequence = ''
for line in testfile:
     if line.startswith('>'):
         identifiers.append(line)
         sequences.append(current_sequence)
         current_sequence = ''
     else:
         current_sequence += line.strip('\n')

df = pd.DataFrame({'identifiers' = identifiers, 
                   'sequences' = sequences})

此代碼是否有效取決於您未提供的輸入的詳細信息,但這可能會讓您開始。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM